ECS Capacity Provider strategies let you dictate precisely how your application instances are provisioned, moving beyond simple cluster-wide settings.

Let’s see this in action. Imagine you have two different application workloads on ECS: a low-priority batch job and a high-priority web service. You want to ensure the web service always has instances ready, even if it means paying a bit more, while the batch job can scale down aggressively and use cheaper, preemptible instances.

Here’s a simplified aws_ecs_cluster_capacity_providers resource in Terraform that sets this up:

resource "aws_ecs_cluster" "my_cluster" {
  name = "my-special-cluster"
}

resource "aws_ecs_capacity_provider" "fargate_on_demand" {
  name = "fargate-on-demand"
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.fargate_asg.arn
    managed_scaling {
      maximum_step_scaling_triggers = [10]
      target_capacity              = 100
    }
    managed_termination_protection = "DISABLED"
  }
}

resource "aws_ecs_capacity_provider" "spot_managed" {
  name = "spot-managed"
  auto_scaling_group_provider {
    auto_scaling_group_arn = aws_autoscaling_group.spot_asg.arn
    managed_scaling {
      maximum_step_scaling_triggers = [5]
      target_capacity              = 50
    }
    managed_termination_protection = "ENABLED"
  }
}

resource "aws_ecs_cluster_capacity_providers" "my_cluster_providers" {
  cluster_name = aws_ecs_cluster.my_cluster.name
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.fargate_on_demand.name
    weight            = 1
    base_capacity     = 1
  }
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot_managed.name
    weight            = 0
    base_capacity     = 0
  }
}

resource "aws_autoscaling_group" "fargate_asg" {
  # ... details for Fargate ASG ...
}

resource "aws_autoscaling_group" "spot_asg" {
  # ... details for Spot ASG ...
}

In this example, fargate_on_demand is configured with managed_termination_protection = "DISABLED", meaning its underlying Auto Scaling Group (ASG) can scale down freely. The spot_managed provider, however, has managed_termination_protection = "ENABLED", which attempts to keep the instances in its ASG alive even during scale-down events.

The ecs_cluster_capacity_providers resource ties these together for my-special-cluster. The default_capacity_provider_strategy block defines the primary way ECS will provision capacity. Here, fargate_on_demand has a weight of 1 and a base_capacity of 1. This tells ECS to always provision at least 1 instance using the fargate_on_demand provider and to consider it first for any new tasks.

The second capacity_provider_strategy block for spot_managed has a weight of 0 and base_capacity of 0. This means ECS will never use this provider for its default provisioning.

Now, how do you make ECS use the spot_managed provider for your batch jobs? You specify it directly in your ECS Service definition.

resource "aws_ecs_service" "web_service" {
  cluster         = aws_ecs_cluster.my_cluster.name
  task_definition = aws_ecs_task_definition.my_task.arn
  desired_count   = 2
  launch_type     = "FARGATE" # Or EC2 if you're using EC2 launch type with capacity providers
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.fargate_on_demand.name
    weight            = 1
    base_capacity     = 1
  }
}

resource "aws_ecs_service" "batch_job_service" {
  cluster         = aws_ecs_cluster.my_cluster.name
  task_definition = aws_ecs_task_definition.batch_task.arn
  desired_count   = 0 # Start with 0, scale up based on queue
  launch_type     = "EC2"
  capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.spot_managed.name
    weight            = 1
    base_capacity     = 0
  }
}

For web_service, we explicitly tell it to use fargate_on_demand with a weight of 1 and base capacity of 1, aligning with our cluster’s default. For batch_job_service, we override the cluster default and specify spot_managed with a weight of 1 and base capacity of 0. This means batch jobs will only be placed on instances managed by the spot_managed ASG, and ECS will try to scale that ASG up as needed for batch tasks. If the spot_managed ASG can’t provision instances (e.g., no Spot capacity available), those batch tasks will remain pending.

The key to this flexibility lies in the weight and base_capacity parameters within the capacity_provider_strategy blocks. The base_capacity is the minimum number of instances ECS will ensure are available from that provider before considering the weights. The weight determines the proportion of remaining tasks that will be placed on a provider once the base capacity is met. A weight of 0 means that provider is only used if explicitly specified in a service definition or if all other providers with weights are exhausted.

When ECS needs to launch a task, it first checks the default_capacity_provider_strategy of the cluster. It will satisfy the base_capacity for each provider, starting with the first one listed. Then, for any remaining tasks, it distributes them across providers based on their weight relative to the sum of all weights. If a service defines its own capacity_provider_strategy, that overrides the cluster’s default for that specific service.

One subtle but critical point is how managed_termination_protection on the auto_scaling_group_provider interacts with your service’s desired_count and ECS’s scaling decisions. When managed_termination_protection is enabled for a capacity provider, ECS will try to prevent instances in its associated ASG from being terminated during scale-down events, even if the ASG itself wants to reduce its desired capacity. This is crucial for ensuring your high-priority services remain available but can also lead to situations where your ASG’s desired count is higher than what your service’s desired_count strictly requires, effectively "reserving" capacity.

The next concept you’ll grapple with is how to integrate custom scaling logic with these capacity providers, especially when dealing with external event sources like SQS queues.

Want structured learning?

Take the full Ecs course →