ECS task failures are almost always caused by the task’s container failing to start or crashing shortly after.

Here’s how to dig into why a container might be failing, using the StoppedReason and StoppedCode fields from the ECS API, which you can see in the AWS console under your ECS task definition or directly via the AWS CLI.

Common Causes for Task Failures

The most frequent culprits boil down to resource constraints, configuration errors, or the application itself misbehaving.

1. Out of Memory (OOM)

This is the big one. Your application tried to use more RAM than ECS allocated to the task, and the Linux kernel stepped in to kill it.

  • Diagnosis: Look for StoppedReason containing "Out of memory" or StoppedCode 137 (which is SIGKILL often delivered by the OOM killer).
  • Fix: Increase the memory limit in your task definition. If your task definition has memory: 512 (MiB), try memory: 1024. If you’re using memoryReservation, ensure your container’s actual memory usage doesn’t exceed the hard memory limit for the task.
  • Why it works: You’re giving the container more physical RAM to work with, preventing the kernel from forcibly terminating it.

2. Resource Contention (CPU)

Less common than OOM, but still a killer. Your container is hitting its CPU limits, causing it to become unresponsive or crash.

  • Diagnosis: Look for StoppedReason indicating CPU exhaustion or StoppedCode 137 (again, SIGKILL can be triggered by resource limits, though OOM is more typical). You’ll often see high CPU utilization in CloudWatch metrics for the task before it stops.
  • Fix: Increase the cpu limit in your task definition. If your task definition has cpu: 256 (vCPU units, where 1024 = 1 vCPU), try cpu: 512.
  • Why it works: You’re allocating more processing power to the container, allowing it to complete its operations without being throttled to death.

3. Application Exit (Non-Zero Exit Code)

Your application itself decided to stop, but not in a clean way. A non-zero exit code signals an error.

  • Diagnosis: Look for StoppedReason containing "Container exited" and a StoppedCode that is a small integer (e.g., 1, 2, 126, 127). These are standard Unix exit codes. 126 often means the command was not executable, 127 means command not found.
  • Fix: This is the trickiest because it requires understanding your application’s exit codes.
    • Command not found (127): Double-check your ENTRYPOINT or CMD in the Dockerfile or task definition. Ensure the executable path is correct and the file exists within the container image.
    • Not executable (126): Ensure your script has execute permissions (chmod +x). This is often forgotten in Dockerfiles.
    • Other codes: Consult your application’s documentation or logs for specific error codes. You might need to add more logging to your application to pinpoint the issue.
  • Why it works: You’re fixing the underlying bug in your application or its startup command that causes it to terminate with an error.

4. Docker Daemon Issues (Less Common for Task Failures, More for Container Start)

The Docker daemon on the EC2 instance or Fargate infrastructure might have had an issue starting or managing the container.

  • Diagnosis: Look for StoppedReason mentioning "Docker daemon" or related errors, or StoppedCode like 125 (Docker daemon error). This is rarer for task failures and more common for container creation failures. Check ECS agent logs on EC2 instances if applicable.
  • Fix: For EC2-backed ECS, ensure your EC2 instances are healthy and the ECS agent is running correctly. Restarting the ECS agent or the EC2 instance can resolve transient daemon issues. For Fargate, this is usually a transient infrastructure issue that AWS resolves.
  • Why it works: Restarts the problematic Docker daemon or underlying infrastructure, allowing it to manage containers correctly again.

5. Incorrect Command or Entrypoint

Similar to application exit codes, but specifically when the ENTRYPOINT or CMD in your Dockerfile or task definition is just wrong.

  • Diagnosis: StoppedReason "Container exited" with StoppedCode 127 (command not found) or 126 (command not executable).
  • Fix: Review your command and entryPoint fields in the task definition JSON. For example, if your Dockerfile’s ENTRYPOINT ["/app/run.sh"] is correct, but your task definition overrides it with command: ["/app/run.sh"], and /app/run.sh is not executable, it will fail. Ensure the path is correct and the file has execute permissions within the container image.
  • Why it works: The task definition’s command is now correctly pointing to an executable script or binary within the container.

6. Essential Files Missing or Corrupt in Image

If your application relies on specific files or directories that aren’t present or are corrupted in the Docker image, it can fail immediately.

  • Diagnosis: StoppedReason "Container exited" with a custom application exit code (e.g., 1, 2, 10). Examine your application’s logs (if you can capture them) or the Dockerfile’s build output for clues.
  • Fix: Rebuild your Docker image, ensuring all necessary application binaries, configuration files, and dependencies are correctly copied and installed. Verify file integrity if possible.
  • Why it works: The container now has all the necessary components for the application to start and run.

7. Health Check Failures

If you’ve configured container health checks, and they fail repeatedly, ECS will stop the task.

  • Diagnosis: Look for StoppedReason containing "Essential container exited" and a StoppedCode that indicates the health check failure (often 1 or a custom code your health check returns). You’ll also see health check failures in CloudWatch Logs.
  • Fix: Debug your application’s health check endpoint or command. Ensure it returns a 200 OK status code (for HTTP checks) or exits with 0 (for command checks) when the application is healthy.
  • Why it works: The application is now correctly signaling its health, allowing ECS to consider it running.

The Next Hurdle: Service Scaling Issues

After fixing task failures, you might find your service isn’t scaling up as expected, or tasks are being replaced too quickly. This often points to issues with the service scheduler and its desired count, or problems with service discovery or load balancer registration.

Want structured learning?

Take the full Ecs course →