ECS task failures are almost always caused by the task’s container failing to start or crashing shortly after.
Here’s how to dig into why a container might be failing, using the StoppedReason and StoppedCode fields from the ECS API, which you can see in the AWS console under your ECS task definition or directly via the AWS CLI.
Common Causes for Task Failures
The most frequent culprits boil down to resource constraints, configuration errors, or the application itself misbehaving.
1. Out of Memory (OOM)
This is the big one. Your application tried to use more RAM than ECS allocated to the task, and the Linux kernel stepped in to kill it.
- Diagnosis: Look for
StoppedReasoncontaining "Out of memory" orStoppedCode137(which isSIGKILLoften delivered by the OOM killer). - Fix: Increase the
memorylimit in your task definition. If your task definition hasmemory: 512(MiB), trymemory: 1024. If you’re usingmemoryReservation, ensure your container’s actual memory usage doesn’t exceed the hardmemorylimit for the task. - Why it works: You’re giving the container more physical RAM to work with, preventing the kernel from forcibly terminating it.
2. Resource Contention (CPU)
Less common than OOM, but still a killer. Your container is hitting its CPU limits, causing it to become unresponsive or crash.
- Diagnosis: Look for
StoppedReasonindicating CPU exhaustion orStoppedCode137(again,SIGKILLcan be triggered by resource limits, though OOM is more typical). You’ll often see high CPU utilization in CloudWatch metrics for the task before it stops. - Fix: Increase the
cpulimit in your task definition. If your task definition hascpu: 256(vCPU units, where 1024 = 1 vCPU), trycpu: 512. - Why it works: You’re allocating more processing power to the container, allowing it to complete its operations without being throttled to death.
3. Application Exit (Non-Zero Exit Code)
Your application itself decided to stop, but not in a clean way. A non-zero exit code signals an error.
- Diagnosis: Look for
StoppedReasoncontaining "Container exited" and aStoppedCodethat is a small integer (e.g.,1,2,126,127). These are standard Unix exit codes.126often means the command was not executable,127means command not found. - Fix: This is the trickiest because it requires understanding your application’s exit codes.
- Command not found (
127): Double-check yourENTRYPOINTorCMDin the Dockerfile or task definition. Ensure the executable path is correct and the file exists within the container image. - Not executable (
126): Ensure your script has execute permissions (chmod +x). This is often forgotten in Dockerfiles. - Other codes: Consult your application’s documentation or logs for specific error codes. You might need to add more logging to your application to pinpoint the issue.
- Command not found (
- Why it works: You’re fixing the underlying bug in your application or its startup command that causes it to terminate with an error.
4. Docker Daemon Issues (Less Common for Task Failures, More for Container Start)
The Docker daemon on the EC2 instance or Fargate infrastructure might have had an issue starting or managing the container.
- Diagnosis: Look for
StoppedReasonmentioning "Docker daemon" or related errors, orStoppedCodelike125(Docker daemon error). This is rarer for task failures and more common for container creation failures. Check ECS agent logs on EC2 instances if applicable. - Fix: For EC2-backed ECS, ensure your EC2 instances are healthy and the ECS agent is running correctly. Restarting the ECS agent or the EC2 instance can resolve transient daemon issues. For Fargate, this is usually a transient infrastructure issue that AWS resolves.
- Why it works: Restarts the problematic Docker daemon or underlying infrastructure, allowing it to manage containers correctly again.
5. Incorrect Command or Entrypoint
Similar to application exit codes, but specifically when the ENTRYPOINT or CMD in your Dockerfile or task definition is just wrong.
- Diagnosis:
StoppedReason"Container exited" withStoppedCode127(command not found) or126(command not executable). - Fix: Review your
commandandentryPointfields in the task definition JSON. For example, if your Dockerfile’sENTRYPOINT ["/app/run.sh"]is correct, but your task definition overrides it withcommand: ["/app/run.sh"], and/app/run.shis not executable, it will fail. Ensure the path is correct and the file has execute permissions within the container image. - Why it works: The task definition’s command is now correctly pointing to an executable script or binary within the container.
6. Essential Files Missing or Corrupt in Image
If your application relies on specific files or directories that aren’t present or are corrupted in the Docker image, it can fail immediately.
- Diagnosis:
StoppedReason"Container exited" with a custom application exit code (e.g.,1,2,10). Examine your application’s logs (if you can capture them) or the Dockerfile’s build output for clues. - Fix: Rebuild your Docker image, ensuring all necessary application binaries, configuration files, and dependencies are correctly copied and installed. Verify file integrity if possible.
- Why it works: The container now has all the necessary components for the application to start and run.
7. Health Check Failures
If you’ve configured container health checks, and they fail repeatedly, ECS will stop the task.
- Diagnosis: Look for
StoppedReasoncontaining "Essential container exited" and aStoppedCodethat indicates the health check failure (often1or a custom code your health check returns). You’ll also see health check failures in CloudWatch Logs. - Fix: Debug your application’s health check endpoint or command. Ensure it returns a
200 OKstatus code (for HTTP checks) or exits with0(for command checks) when the application is healthy. - Why it works: The application is now correctly signaling its health, allowing ECS to consider it running.
The Next Hurdle: Service Scaling Issues
After fixing task failures, you might find your service isn’t scaling up as expected, or tasks are being replaced too quickly. This often points to issues with the service scheduler and its desired count, or problems with service discovery or load balancer registration.