ECS tasks often fail to start or get OOMKilled because CPU and memory limits are set too low, leading to unpredictable application behavior and service disruptions.
Let’s dive into how to set these limits correctly.
The Problem: Underestimating Resource Needs
The most common pitfall is simply guessing CPU and memory values based on a vague idea of the application’s "size." This leads to tasks being rejected by the ECS scheduler (CPU/memory too low to even start) or, more insidiously, getting killed by the Linux kernel’s Out-Of-Memory (OOM) killer when they exceed their allocated memory.
How ECS Manages Resources
ECS uses Linux control groups (cgroups) to enforce CPU and memory limits for your tasks. When you define cpu and memory in your task definition, ECS translates these into cgroup settings.
- CPU: This is specified in "CPU units." 1024 CPU units equals 1 vCPU. So, a value of 512 means 0.5 vCPU, and 2048 means 2 vCPUs. These are hard limits; a task cannot exceed its allocated CPU.
- Memory: This is specified in MiB (Mebibytes). It’s a soft limit for the task’s memory, but the kernel’s OOM killer will enforce a hard limit if the soft limit is breached.
Diagnosing Resource Issues
Before setting limits, you need to understand your application’s actual needs.
- Monitor Existing Tasks: Use CloudWatch Container Insights or custom CloudWatch metrics to observe your application’s CPU and memory utilization during peak load. Look for metrics like
CPUUtilizationandMemoryUtilizationat the task level. Pay close attention to the 95th or 99th percentile values. - Simulate Load: If you don’t have production traffic, use load testing tools (e.g., Artillery, k6, Locust) to simulate realistic user traffic against your application. Monitor resource usage during these tests.
- Check ECS Task Events: In the ECS console, navigate to your cluster, then to "Tasks." Select a task that failed to start or was stopped. The "Task events" tab will show messages like "Resource limit exceeded" or "Task failed to start because the container is using more memory than the maximum allowed."
Setting CPU Limits
CPU is often about ensuring your application has enough processing power to handle requests without becoming unresponsive.
-
Diagnosis:
- Observe
CPUUtilizationin CloudWatch Container Insights for your tasks. - Look at the
cpu_usage.total_usageandcpu_limitmetrics exposed by the cAdvisor agent (often available via Container Insights or by querying the ECS agent API). - Command Example (if you have direct access to the EC2 instance):
(Note:docker inspect <container_id> | grep -i cpuset_cpus docker stats --no-stream <container_id>docker inspectshows limits,docker statsshows usage).
- Observe
-
Common Causes & Fixes:
- Underestimation of Peak Load: Your application needs more CPU than you’ve allocated to handle bursts of traffic.
- Fix: Increase the
cpuvalue in your task definition. If you observed 70% utilization on a 512 CPU unit task during peak, consider increasing to 1024 CPU units (1 vCPU). - Why it works: Provides more processing threads for the task to use.
- Fix: Increase the
- CPU Throttling: The task is hitting its CPU limit, causing requests to be delayed or dropped.
- Diagnosis: Monitor
CPUUtilizationin CloudWatch. If it’s consistently at 100% of your allocated limit, you’re being throttled. - Fix: Increase the
cpuvalue. For example, from 512 to 1024. - Why it works: Allocates more CPU resources, reducing or eliminating throttling.
- Diagnosis: Monitor
- Shared Instance CPU Contention: If running on EC2 launch type, other tasks on the same EC2 instance might be consuming CPU, starving your task.
- Fix: Ensure your EC2 instances are sized appropriately for the aggregate CPU needs of all tasks running on them. Use
toporhtopon the EC2 instance to check overall CPU usage. - Why it works: Prevents resource starvation by ensuring the underlying host has enough CPU.
- Fix: Ensure your EC2 instances are sized appropriately for the aggregate CPU needs of all tasks running on them. Use
- Incorrect CPU Reservation vs. Limit: For Fargate, CPU is a hard limit. For EC2, it’s a hard limit. You generally want to set limits slightly above your peak observed usage to avoid throttling but not excessively high to waste resources.
- Fix: Set the
cpuvalue in the task definition to be at least 1.5x to 2x your 95th percentile observed CPU usage during peak load. For example, if peak usage is consistently 700 units, set the limit to 1024 or 1536. - Why it works: Provides headroom for spiky workloads without over-provisioning.
- Fix: Set the
- Underestimation of Peak Load: Your application needs more CPU than you’ve allocated to handle bursts of traffic.
Setting Memory Limits
Memory is often more critical for stability due to the OOM killer.
-
Diagnosis:
- Monitor
MemoryUtilizationandMemoryReservationin CloudWatch Container Insights. - Crucially, look for OOM events: If tasks are being terminated unexpectedly with exit codes like
137or134, it’s a strong indicator of OOM kills. Check ECS task logs and system logs on EC2 instances (if applicable) for "Out of memory: Kill process" messages. - Command Example (if you have direct access to the EC2 instance):
sudo journalctl -xe | grep -i "Out of memory" sudo dmesg | grep -i "Out of memory" docker inspect <container_id> | grep -i memory
- Monitor
-
Common Causes & Fixes:
- Underestimating Peak Memory Usage: Your application’s memory footprint grows over time or spikes under load.
- Diagnosis: Observe
MemoryUtilizationin CloudWatch. If it’s consistently high (e.g., 80-90%), you’re at risk. - Fix: Increase the
memoryvalue in your task definition. If your 95th percentile usage is 1500 MiB, set the task memory limit to at least 2048 MiB. - Why it works: Provides more RAM for the application to use, preventing it from hitting the OOM killer threshold.
- Diagnosis: Observe
- Memory Leaks: A bug in your application causes it to continuously consume more memory without releasing it.
- Diagnosis: Observe
MemoryUtilizationclimbing steadily over time in CloudWatch, even under constant load. - Fix: This requires application code changes to fix the leak. In the interim, you might need to set a memory limit that’s high enough to prevent OOM kills for a reasonable duration, or implement application-level restarts.
- Why it works: Addresses the root cause of unbounded memory growth.
- Diagnosis: Observe
- JVM Heap Size Configuration: For Java applications, the JVM’s heap size (
-Xmx) often defaults to a fraction of the container’s memory. If this isn’t configured correctly, the JVM might try to allocate more memory than the container allows.- Fix: Explicitly set the JVM heap size using
-Xmxand-Xmsin your application’s startup command. A common pattern isjava -Xmx1024m -Xms512m -jar myapp.jar. Ensure-Xmxis less than your container’s memory limit, leaving room for the OS and other non-heap memory. For example, if your task has 2048 MiB memory, set-Xmx1536m. - Why it works: The JVM respects the explicit heap size, which is itself bounded by the container’s memory limit.
- Fix: Explicitly set the JVM heap size using
- Application Dependencies: Libraries or frameworks might have higher memory overhead than anticipated.
- Fix: Profile your application’s memory usage with tools like YourKit, JProfiler, or
pprof(for Go) to identify which components consume the most memory. Adjust task memory limits or optimize those components. - Why it works: Identifies and allows optimization of memory-hungry parts of your application.
- Fix: Profile your application’s memory usage with tools like YourKit, JProfiler, or
- Buffer/Cache Usage: Some processes (like databases or file system caches) might use memory that the kernel can reclaim if needed. However, the OOM killer doesn’t always distinguish well between actively used memory and cache.
- Fix: Ensure your task memory limit is sufficiently higher than the application’s working set to accommodate temporary spikes and OS caching. A common rule of thumb is to set the task memory limit to 1.5x to 2x your observed peak application memory usage, ensuring you have headroom.
- Why it works: Provides a larger buffer, making it less likely for the OOM killer to be invoked by legitimate memory usage or caching.
- Underestimating Peak Memory Usage: Your application’s memory footprint grows over time or spikes under load.
Soft vs. Hard Limits (Memory)
ECS task definitions specify memory (soft limit) and memoryReservation (hard limit).
memory: The hard limit for the task’s memory. If the task exceeds this, it will be OOM killed.memoryReservation: A soft limit. ECS tries to schedule tasks such that their average memory usage doesn’t exceed this. This is primarily for scheduling optimization, not enforcement.
Recommendation: Set memory to your actual observed peak usage plus a buffer (e.g., 20-50%). You can often leave memoryReservation unset or set it to a value slightly lower than memory if you want ECS to prioritize scheduling based on a slightly more conservative estimate, but the memory value is the one that triggers the OOM killer. For critical applications, setting memory is paramount.
The next error you’ll likely encounter after fixing CPU and memory limits is a Port 80 is already in use error if you’re not correctly managing port mappings or using dynamic port allocation.