ECS deployments can fail to auto-rollback when a circuit breaker is configured, leaving you with an unhealthy service.
Common Causes and Fixes for Auto-Rollback Failures:
-
minimumHealthyPercentToo Low:- Diagnosis: Check your ECS service definition for
minimumHealthyPercent.aws ecs describe-services --cluster my-cluster --service my-service --query 'services[0].deployments[?status==`PRIMARY`].minimumHealthyPercent' - Fix: Increase
minimumHealthyPercentto at least 50. For example, to set it to 100%:aws ecs update-service --cluster my-cluster --service my-service --minimum-healthy-percent 100 - Why it works: The circuit breaker monitors the number of running tasks during a deployment. If
minimumHealthyPercentis too low (e.g., 0), ECS might consider the deployment "successful" even if all new tasks are unhealthy, preventing the rollback. Setting it to 100 ensures that ECS waits for all previous tasks to be healthy before considering the deployment complete.
- Diagnosis: Check your ECS service definition for
-
maximumPercentToo High:- Diagnosis: Check your ECS service definition for
maximumPercent.aws ecs describe-services --cluster my-cluster --service my-service --query 'services[0].deployments[?status==`PRIMARY`].maximumPercent' - Fix: Decrease
maximumPercentto ensure it doesn’t exceedminimumHealthyPercentby too much, ideally keeping it close to 100, e.g., 200.aws ecs update-service --cluster my-cluster --service my-service --maximum-percent 200 - Why it works:
maximumPercentdefines the upper limit of tasks that can run concurrently during a deployment. If it’s set extremely high, ECS might launch many new, unhealthy tasks without terminating old ones, leading to a prolonged unhealthy state that the circuit breaker doesn’t flag as a failure in time for rollback.
- Diagnosis: Check your ECS service definition for
-
Task Definition
stopTimeoutToo Short:- Diagnosis: Examine your task definition’s
stopTimeoutparameter.
(Note: This parameter is often set in theaws ecs describe-task-definition --task-definition my-task-definition --query 'taskDefinition.stopTimeout'runTaskorregisterTaskDefinitionAPI calls, not directly visible indescribe-task-definitionoutput if it’s the default. It’s more commonly configured asstopTimeoutin the ECS service definition itself, or implied by the container’sstop_timeoutin Docker Compose/Kubernetes.) - Fix: Ensure your container’s
stopTimeout(or ECS servicestopTimeout) is set to at least 30 seconds, or ideally 60 seconds. This is typically configured within the container definition in your task definition JSON or viadocker-compose.ymlif using Fargate with Compose.// Example in Task Definition JSON "containerDefinitions": [ { "name": "my-app", "image": "my-repo/my-app:latest", "stopTimeout": 60 // seconds // ... other params } ] - Why it works: When ECS stops a task, it sends a
SIGTERMsignal to the containers. ThestopTimeoutis the grace period a container has to shut down cleanly. If the container doesn’t exit within this time, ECS forcefully kills it. A shortstopTimeoutcan lead to tasks being killed before they can properly signal completion or health checks, confusing the deployment orchestrator.
- Diagnosis: Examine your task definition’s
-
Health Check Configuration Issues:
- Diagnosis: Verify your container health check configuration within the task definition and your load balancer (if applicable).
aws ecs describe-task-definition --task-definition my-task-definition --query 'taskDefinition.containerDefinitions[?name==`my-app`].healthCheck' aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --query 'targetGroups[0].healthCheck' - Fix: Ensure your health check path is correct, the interval is reasonable (e.g., 30 seconds), the timeout is sufficient (e.g., 5 seconds), and the unhealthy threshold is low enough (e.g., 2).
// Example container health check in Task Definition "healthCheck": { "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], "interval": 30, "timeout": 5, "retries": 2 } - Why it works: If health checks are misconfigured, ECS might incorrectly believe tasks are healthy when they are not, or vice-versa. A faulty health check can prevent new tasks from passing, thus not allowing the deployment to progress and subsequently failing the rollback condition.
- Diagnosis: Verify your container health check configuration within the task definition and your load balancer (if applicable).
-
Load Balancer Target Group Unhealthy Threshold Too High:
- Diagnosis: Check the
UnhealthyThresholdCountfor your load balancer target group.aws elbv2 describe-target-groups --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --query 'targetGroups[0].healthCheck.unhealthyThresholdCount' - Fix: Lower the
UnhealthyThresholdCountto a sensible value, like 2 or 3.aws elbv2 modify-target-group-attributes --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-target-group/abcdef1234567890 --attributes Key=HealthCheckUnhealthyThresholdCount,Value=2 - Why it works: The circuit breaker relies on the health status reported by the load balancer (if used). If the unhealthy threshold is too high, a task can remain registered as healthy in the target group for a prolonged period even if it’s actually failing health checks intermittently. This masks the problem from ECS’s rollback mechanism.
- Diagnosis: Check the
-
ECS Agent Issues (Less Common):
- Diagnosis: Check the ECS agent logs on your EC2 instances or for your Fargate tasks.
# For EC2 instances: sudo journalctl -u ecs -f # For Fargate, check CloudWatch logs for the ECS agent (if configured) - Fix: Ensure your ECS agent is updated to the latest version. If issues persist, consider restarting the agent or the EC2 instance.
- Why it works: An outdated or buggy ECS agent might not correctly report task status or communicate with the ECS control plane, leading to misinterpretations of deployment health and failed rollbacks.
- Diagnosis: Check the ECS agent logs on your EC2 instances or for your Fargate tasks.
The next error you’ll likely encounter after fixing rollback issues is a deployment timeout if the underlying application problem isn’t resolved.