ECS container health checks are actually designed to prevent new work from being sent to a failing container, not to immediately kill it.
Let’s see this in action. Imagine we have a simple web service running in ECS.
{
"family": "my-web-service",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "web-app",
"image": "nginx:latest",
"portMappings": [
{
"containerPort": 80,
"protocol": "tcp"
}
],
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:80/ || exit 1"
],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
In this configuration, the healthCheck block tells ECS how to determine if our web-app container is healthy.
command: This is the actual command ECS runs inside the container. Here, it’s a shell command that tries tocurlthe local web server on port 80. The-fflag makescurlreturn a non-zero exit code on HTTP errors (like 404 or 500).|| exit 1ensures that ifcurlfails, the command exits with a non-zero status, signaling failure.interval: ECS will run this check every 30 seconds.timeout: If thecommanddoesn’t complete within 5 seconds, it’s considered a failure.retries: ECS will try the command up to 3 times before marking the container as unhealthy.startPeriod: This is crucial. For the first 60 seconds after a container starts, ECS will ignore health check failures. This gives your application time to boot up without being immediately flagged as unhealthy.
When ECS starts a task, it launches the container. After the startPeriod (60 seconds in our example), it begins executing the healthCheck command at the specified interval (30 seconds).
If the command returns a non-zero exit code, ECS increments a failure count for that container. Once this count reaches retries (3 in our example), ECS marks the container as UNHEALTHY.
The real magic happens next. When a container is marked UNHEALTHY, ECS stops sending new network traffic to it. If you’re using a load balancer, it will be removed from the target group. If you’re using service discovery, its entry will be de-registered. Crucially, ECS does not immediately stop or replace the container. It just stops routing traffic to it. The container continues running until the task itself is stopped or replaced due to a desired count change, deployment update, or the underlying EC2 instance becoming unhealthy.
The problem ECS doesn’t solve out-of-the-box is automatically restarting a container that becomes unhealthy within the same task. If you want that behavior, you’d typically configure your task definition to restart the container if it exits with a non-zero status, or rely on ECS’s task replacement mechanisms.
This distinction is key: health checks are for traffic routing, not immediate container termination.
Consider a scenario where your application starts fine, passes its initial health checks, but then a background process within the container crashes, causing subsequent requests to fail. The healthCheck command, which likely targets a web endpoint, will now start returning errors. ECS will detect this, stop sending traffic to the UNHEALTHY container, and prevent new users from hitting a broken instance. However, the container process itself is still running, consuming resources, until the task is eventually replaced.
If your healthCheck command is failing intermittently, and you’re seeing the container marked UNHEALTHY but then HEALTHY again on the next check, you might need to increase retries or adjust the interval and timeout to be more lenient, depending on your application’s recovery capabilities. For instance, if your service takes 10 seconds to recover from a brief internal hiccup, setting timeout: 5 and retries: 3 will cause it to be marked unhealthy even if it self-heals quickly.
The most common mistake is to assume that setting retries: 3 means the container will be killed after 3 consecutive failures. In reality, ECS tracks failures and successes. A container is marked UNHEALTHY once the cumulative number of failures during the startPeriod (or since last healthy) exceeds retries. If it then passes a check, the failure count is reset. This means a container could be marked UNHEALTHY and then HEALTHY multiple times within a short period if the failures are not sustained.
This behavior is often misunderstood when debugging. You might see a container flip between HEALTHY and UNHEALTHY rapidly, leading you to believe the health check itself is flaky, when in fact the application is exhibiting transient issues that ECS is correctly identifying for traffic routing purposes.
The next thing you’ll likely encounter is how to integrate this with actual task replacement strategies, such as using ECS Deployments to automatically replace unhealthy tasks.