ECS deployments can fail to auto-rollback when a circuit breaker is configured, leaving you with an unhealthy service.

Common Causes and Fixes for Auto-Rollback Failures:

  1. minimumHealthyPercent Too Low:

    • Diagnosis: Check your ECS service definition for minimumHealthyPercent.
      aws ecs describe-services --cluster my-cluster --service my-service --query 'services[0].deployments[?status==`PRIMARY`].minimumHealthyPercent'
      
    • Fix: Increase minimumHealthyPercent to at least 50. For example, to set it to 100%:
      aws ecs update-service --cluster my-cluster --service my-service --minimum-healthy-percent 100
      
    • Why it works: The circuit breaker monitors the number of running tasks during a deployment. If minimumHealthyPercent is too low (e.g., 0), ECS might consider the deployment "successful" even if all new tasks are unhealthy, preventing the rollback. Setting it to 100 ensures that ECS waits for all previous tasks to be healthy before considering the deployment complete.
  2. maximumPercent Too High:

    • Diagnosis: Check your ECS service definition for maximumPercent.
      aws ecs describe-services --cluster my-cluster --service my-service --query 'services[0].deployments[?status==`PRIMARY`].maximumPercent'
      
    • Fix: Decrease maximumPercent to ensure it doesn’t exceed minimumHealthyPercent by too much, ideally keeping it close to 100, e.g., 200.
      aws ecs update-service --cluster my-cluster --service my-service --maximum-percent 200
      
    • Why it works: maximumPercent defines the upper limit of tasks that can run concurrently during a deployment. If it’s set extremely high, ECS might launch many new, unhealthy tasks without terminating old ones, leading to a prolonged unhealthy state that the circuit breaker doesn’t flag as a failure in time for rollback.
  3. Task Definition stopTimeout Too Short:

    • Diagnosis: Examine your task definition’s stopTimeout parameter.
      aws ecs describe-task-definition --task-definition my-task-definition --query 'taskDefinition.stopTimeout'
      
      (Note: This parameter is often set in the runTask or registerTaskDefinition API calls, not directly visible in describe-task-definition output if it’s the default. It’s more commonly configured as stopTimeout in the ECS service definition itself, or implied by the container’s stop_timeout in Docker Compose/Kubernetes.)
    • Fix: Ensure your container’s stopTimeout (or ECS service stopTimeout) is set to at least 30 seconds, or ideally 60 seconds. This is typically configured within the container definition in your task definition JSON or via docker-compose.yml if using Fargate with Compose.
      // Example in Task Definition JSON
      "containerDefinitions": [
          {
              "name": "my-app",
              "image": "my-repo/my-app:latest",
              "stopTimeout": 60 // seconds
              // ... other params
          }
      ]
      
    • Why it works: When ECS stops a task, it sends a SIGTERM signal to the containers. The stopTimeout is the grace period a container has to shut down cleanly. If the container doesn’t exit within this time, ECS forcefully kills it. A short stopTimeout can lead to tasks being killed before they can properly signal completion or health checks, confusing the deployment orchestrator.
  4. Health Check Configuration Issues:

    • Diagnosis: Verify your container health check configuration within the task definition and your load balancer (if applicable).
      aws ecs describe-task-definition --task-definition my-task-definition --query 'taskDefinition.containerDefinitions[?name==`my-app`].healthCheck'
      aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --query 'targetGroups[0].healthCheck'
      
    • Fix: Ensure your health check path is correct, the interval is reasonable (e.g., 30 seconds), the timeout is sufficient (e.g., 5 seconds), and the unhealthy threshold is low enough (e.g., 2).
      // Example container health check in Task Definition
      "healthCheck": {
          "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
          "interval": 30,
          "timeout": 5,
          "retries": 2
      }
      
    • Why it works: If health checks are misconfigured, ECS might incorrectly believe tasks are healthy when they are not, or vice-versa. A faulty health check can prevent new tasks from passing, thus not allowing the deployment to progress and subsequently failing the rollback condition.
  5. Load Balancer Target Group Unhealthy Threshold Too High:

    • Diagnosis: Check the UnhealthyThresholdCount for your load balancer target group.
      aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --query 'targetGroups[0].healthCheck.unhealthyThresholdCount'
      
    • Fix: Lower the UnhealthyThresholdCount to a sensible value, like 2 or 3.
      aws elbv2 modify-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --attributes Key=HealthCheckUnhealthyThresholdCount,Value=2
      
    • Why it works: The circuit breaker relies on the health status reported by the load balancer (if used). If the unhealthy threshold is too high, a task can remain registered as healthy in the target group for a prolonged period even if it’s actually failing health checks intermittently. This masks the problem from ECS’s rollback mechanism.
  6. ECS Agent Issues (Less Common):

    • Diagnosis: Check the ECS agent logs on your EC2 instances or for your Fargate tasks.
      # For EC2 instances:
      sudo journalctl -u ecs -f
      # For Fargate, check CloudWatch logs for the ECS agent (if configured)
      
    • Fix: Ensure your ECS agent is updated to the latest version. If issues persist, consider restarting the agent or the EC2 instance.
    • Why it works: An outdated or buggy ECS agent might not correctly report task status or communicate with the ECS control plane, leading to misinterpretations of deployment health and failed rollbacks.

The next error you’ll likely encounter after fixing rollback issues is a deployment timeout if the underlying application problem isn’t resolved.

Want structured learning?

Take the full Ecs course →