Ideal ECS: Why Startups Accidentally Reverse-Engineer Kubernetes

Peter Stukalov
Nov 3, 2025
9 min read

When you have a production incident, the first question you ask is: "What changed?" Everything was working perfectly, then something happened, and the "all good" state turned into an "on fire" state. AWS is incredibly reliable. In 99.99999% of cases, the problem is human. Someone made a change that created the problem.

Perhaps each individual change was correct on its own. But they weren't made at the same time, or they weren't applied at the same time.

Part 1. The Pain: Typical Startup Chaos

Let's look at realistic examples of where things start to fall apart.

The Consistency Problem: The "State Snapshot"

This is the central problem from which all others grow.

AWS Parameter Store is versioned. But it's versioned at the individual parameter level. AWS Secrets Manager is also versioned at the individual secret level.

You have a choice:

Store your entire configuration (for all services) in one giant parameter/secret. This gives you "snapshot" consistency, but it's unwieldy and completely breaks granular access control.
Store parameters/secrets separately. This is convenient, but you lose the guarantee of consistency. You can't say, "Give me the state of all parameters as of 10:00 AM."

The Race Condition Problem: Autoscaling vs. Deployments

This is a symptom of the "Consistency Problem."

You change a parameter/secret (v2).
You update your Task Definition (v2), pinning it to the new version.
You start the deployment.

While the deployment is rolling out, autoscaling kicks in. It launches tasks with the OLD version (v1), but those tasks pull the NEW version (v2) of the parameters. Boom. Production is down.

This problem is solvable: you have to strictly version your secrets/parameters and specify the exact ARN with the version in your Task Definition, not latest. But this is manual discipline, and it's easy to forget.

But What About CodeDeploy? (The "80/20" Argument)

We have to address this immediately. An experienced engineer will say, "Dude, your problems are solved by AWS CodeDeploy (Blue/Green or Canary)."

And they would be 80% correct.

If you strictly version your parameters and secrets (as discussed above), CodeDeploy saves you from 80% of the problems related to the application. It will perfectly roll back your Task Definition.

But you are still left with the 20% of problems that come from infrastructure changes.

CodeDeploy does not control your infrastructure. It only rolls back the Task Definition. It will not roll back:

A new SQS queue you created (which the v2 app expects).
New IAM permissions you granted.
A new S3 bucket or changes to your ALB rules.

The problem is that a feature (e.g., "add profile picture uploads") is simultaneously code (v2), config (v2), and infrastructure (a new S3 bucket). CodeDeploy will only roll back the code, leaving you in an inconsistent state.

The Audit Problem: "What Is Actually in Prod Right Now?"

This is a direct result of the "Consistency Problem." To understand the full picture, you have to manually assemble a "snapshot":

The application version (from the Task Definition).
The secret versions/values (from Secrets Manager).
The parameter versions/values (from Parameter Store).
The infrastructure configuration (from Terraform/CloudFormation, assuming it hasn't "drifted" from reality).

This costs you precious minutes during an incident.

The "Source of Truth" Problem

How do startups usually deploy?

You have a repository for infrastructure (Terraform).
You have repositories for your microservices (CI/CD).

The question is, which of these repositories is responsible for the Task Definition? The CI pipeline? Or Terraform? And who is responsible for the Parameter Store?

In a multi-repo scenario (the most common), you have to conduct an investigation to figure out what changed and what's current. You have to sift through an endless "source of truth": the service's Git repo, the infra Git repo, CI pipelines (both successful and broken), version history in Secrets Manager, and history in Parameter Store.

In a monorepo scenario (which seems like a solution), you only have one repo. But the problem remains: you still have to determine the current state by analyzing deployment pipeline logs and mentally constructing the diff between the desired state (in git) and the actual state (in AWS).

You don't have a single "source of truth." You have several: Git (for code), Git (for infra), Parameter Store (for configs), and Secrets Manager (for secrets). And they all live their own lives.

Part 2. The Path of the Samurai: Building the "Ideal ECS"

Can we simplify this? It would be great to have one simple file that describes a consistent snapshot of the entire system.

Something like this control-center.yaml:

dev:
  services:
    app-one:
      image: "ealen/echo-server:v1.2.0"
      resources:
        cpu: 1024
        memory: 2048
      routing:
        prefix: "/app-one"
      secrets:
        - from: arn:aws:secretsmanager:us-west-2:ACC_ID:secret:my-app/db-pass-123456:a1b2c3d4...
          key: db-pass
      parameters:
        - from: arn:aws:ssm:us-west-2:ACC_ID:parameter/my-app/api-key:3
      vars:
        super_important_future_enable: true
prod:
  services:
    app-one:
      image: "ealen/echo-server:v1.1.0"
      # ... and so on

This file is very simple to understand. We just need to somehow turn this simple file into thousands of resources.

Step 1: Terraform as a Templating Engine

The simplest and most obvious path: let's feed this file into Terraform, and it will create all the necessary resources. Terraform acts as our templating engine.

But how do we turn that "manual discipline" (from Part 1) into an automated invariant?

This explodes into a whole new layer of work:

TaskDefinition Generation: We must generate TaskDefinitions from our control-center.yaml so that the secret/parameter version "pin" is physically impossible to forget.
Pinned-Links Only: Only specific versions must be allowed in the TaskDefinition:
- For SSM: arn:...:parameter/...:<version>
- For Secrets Manager: arn:...:secret:NAME-...:<version-id | version-stage>[:json-key]
Policy-as-code: A Sentinel/OPA/Conftest check is added to the CI pipeline, using regex to validate valueFrom and image (forbidding :latest) and failing the PR if it finds a violation.
CodeDeploy Hooks: Pre- and post-deploy hooks are added to quickly check that the new Task Definition revision doesn't contain any un-pinned links. If it fails, it auto-rolls back.
AWS Config: A Custom rule is configured: "Any TaskDefinition revision without a secret/parameter version is non-compliant."

Just like that, "manual discipline" turns into an automated, but very complex, system to maintain.

Step 2: Speeding Up the Plan

The project grows, and... we start waiting a very long time for Terraform to run a plan across thousands of resources (including all those rules). We're wasting time.

The solution: Terragrunt. We slice the Terraform configuration into pieces (infrastructure separate from services). Now the plan for services runs faster.

Step 3: The Frankenstein

But now, instead of one CI pipeline, we have ten more for infrastructure that we also have to monitor. We can slice it even finer (service-by-service). It gets even faster, but...

We have a ton of new pipelines and templates. We have traded complexity for speed and consistency.

We are standing on a mountain of scripts, Terragrunt modules, and CI pipelines, and we still only care about three questions:

What is deployed right now?
How does it differ from the desired state?
What's the difference from the previous, working version?

We achieved consistency at the cost of this "Frankenstein." The problem is the price we pay to maintain it. Each of those 10+ pipelines is a potential point of failure. Its logs have to be analyzed. We have to control exactly which Terragrunt modules were applied and for which changes in the control-center.yaml. To get answers to our three questions, we have to analyze the input data (control-center.yaml), the templates (Terragrunt modules), and the apply logs (from all CI pipelines). This is extremely complex and fragile.

Part 3. The Revelation: The Off-the-Shelf Solution

Let's keep improving our system. How can we reduce this complexity without losing the benefits we've gained?

We need a system where:

There aren't dozens of pipelines.
There is absolute consistency (one source of truth).
If a configuration can't be applied, we see a clear diff between the desired and actual state.

Such a system exists. We are graduating from a homegrown "GitOps-lite" to the real thing. This system is called ArgoCD.

How Does ArgoCD Solve Our Problems?

ArgoCD is an operator that you install in Kubernetes, and it does exactly what we were dreaming of.

The "Dozens of Pipelines" Problem: ArgoCD replaces the key part of them: the push-based deploy pipelines. It uses a Pull-model, not Push. Instead of N pipelines that push changes, you have one operator (ArgoCD) that pulls the state from Git. Your CI pipeline now only does two things: 1) builds the image, and 2) updates the image in your control-center.yaml. That's it. CI no longer deploys.(Important nuance: CI is, of course, still responsible for build, test, and scan. And for complex strategies (Canary, Blue/Green) and managing DB migrations, there's a separate tool in this ecosystem—Argo Rollouts—which works on top of ArgoCD. But for now, we're focused on the core solution.)
The "Three Questions" Problem (Audit and Drift):
- "What is deployed right now?" — You open the ArgoCD UI and see a live, visual graph of all your resources in production.
- "How does it differ from desired?" — ArgoCD constantly (every 3 minutes) runs a diff between what's in Git (desired) and what's in the cluster (actual).
- "What's the diff from the last version?" — Your entire deployment is a single commit in Git (controlled via Pull Request). A rollback is just git revert. You roll back a consistent snapshot: the app version, the config, and the variables.
The "Manual Edits" Problem (Self-Healing): ArgoCD is an infinite reconciliation loop (self-healing). If someone "fixes" production by hand in the console, ArgoCD (if auto-heal is on) will simply automatically revert it to match Git. Drift from manual edits is impossible.

"But, there's a catch..."

...this doesn't work with ECS. For this, you need a full-blown Kubernetes.

Does this increase our system's complexity compared to our "maxed-out" ECS system?

Yes and no. Kubernetes is conceptually simpler than the "Path of the Samurai" we just described. Why?

Because the "Path of the Samurai" on ECS is homegrown, fragmented, and imperative complexity. You are writing the glue, the retry logic, and the drift control yourself.
Kubernetes (EKS+Argo) is standard, integrated, and declarative complexity. It's not "simple," but it is a complete platform, not a box of parts.

You are trading your custom, fragile complexity for standard, documented complexity. Instead of maintaining your 'Frankenstein' that only you understand, you get a ready-made tool that does all the same things, but better.

The entire "Path of the Samurai" we walked in Part 2 wasn't the path to an "ideal ECS." It is the natural evolution of any growing system. You start with something simple ("just ECS"), and then you inevitably add a control-center, a templating engine, pipelines for speed, and pipelines for those pipelines...

And at the end of that path, you discover that you have reverse-engineered all the same components that Kubernetes and GitOps already solve.

Part 4. The Verdict: Fargate, Karpenter, and the Price of "Simple"

"But ECS Fargate is simple! You just press a button and it works!"

Yes, but that simplicity is an illusion you pay for twice.

1. You Pay with Money (Fargate vs. Karpenter)

Let's look at the facts. Let's compare Fargate (on-demand) prices with EKS (on-demand + Spot) for different load profiles.

A) 8 Tasks × (0.25 vCPU, 0.5 GiB)

Fargate: $0.09874/hr → $72.08/mo
EC2+EKS (1×c7g.large, on-demand): compute $0.0725/hr + EKS $0.10/hr = $0.1725/hr → $125.93/mo
Conclusion: At low load, Fargate is cheaper because the $0.10/hr for the EKS control plane eats all the EC2 savings.

B) 20 Tasks × (0.5 vCPU, 1 GiB) → 10 vCPU / 20 GiB Total

Fargate: $0.4937/hr → $360.40/mo
EC2+EKS (3×c7g.xlarge, on-demand): compute $0.435/hr + EKS $0.10 = $0.535/hr → $390.55/mo (slightly more than Fargate).
EC2+EKS (3×c7g.xlarge, ~60% Spot): $0.274/hr → $200.02/mo (44.5% cheaper than Fargate).
Conclusion: Without Spot, it's a tie. With Spot (which Karpenter manages perfectly), EKS+Karpenter is significantly cheaper.

C) 50 Tasks × (1 vCPU, 2 GiB) → 50 vCPU / 100 GiB Total

Fargate: $2.4685/hr → $1,802.01/mo
EC2+EKS (4×c7g.4xlarge, on-demand): compute $2.32/hr + EKS $0.10 = $2.42/hr → $1,766.60/mo (already cheaper).
EC2+EKS (4×c7g.4xlarge, ~60% Spot): $1.028/hr → $750.44/mo (58% cheaper than Fargate).
Conclusion: At high load, EKS+Karpenter beats Fargate on price even on-demand. With Spot, the savings are colossal.

2. You Pay with Speed (Developer Experience)

Developer Experience (DevEx):
- ECS: To change a simple feature flag, a developer needs to know about Parameter Store, its versioning, and how it's tied to the Task Definition. Or, they have to ask DevOps.
- GitOps (EKS+Argo): A developer changes super_important_future_enable: true in the control-center.yaml (the very same file from Part 2, which is now natively templated by Jsonnet right in ArgoCD) and opens a Pull Request. That's it. They don't need AWS access. They don't need to know about secrets or parameters. They work in Git. This removes 90% of the toil from DevOps and gives the team a "big red button" (the PR) for auditing all changes.
Mean Time to Recovery (MTTR):
- ECS (CodeDeploy): As we established in Part 1, CodeDeploy perfectly solves 80% of the application problems. But the 20% of infrastructure problems (SQS, IAM, S3...) are left behind. It won't roll those back.
- GitOps (EKS+Argo): git revert rolls back the entire consistent snapshot—the app version, the config, the secrets (SealedSecrets), and even external resources (like S3 or RDS, which GitOps can also manage).

Final Thought

Starting with simple ECS/Fargate is completely normal and is the right call for many startups.

But it's crucial to understand that as soon as you start to "improve" that simple system, you will almost certainly step onto the "Path of the Samurai" we described. You will start building your own custom GitOps framework on top of Terraform, Terragrunt, and CI pipelines.

And that isn't a dead-end path—it's just a very long and expensive one.

You will spend a year, and a lot of pain, building from scratch what the Kubernetes ecosystem calls ArgoCD—a free, open-source, and battle-hardened standard. You don't have to walk that path. You can start with the finished solution that was built to solve these exact problems.

More than 20 years of experience