Your Docker Compose healthchecks are probably not working because they’re not testing the right thing, or they’re testing it too aggressively.
Here’s a simple docker-compose.yml that uses a basic ping healthcheck for a service called my-app.
version: '3.8'
services:
my-app:
image: alpine:latest
command: ["sleep", "infinity"]
healthcheck:
test: ["CMD-SHELL", "ping -c 1 localhost"]
interval: 10s
timeout: 5s
retries: 3
start_period: 10s
When you run docker compose up -d, Docker will execute the ping -c 1 localhost command every 10 seconds. If it fails 3 times within a 5-second timeout, the container is marked as unhealthy. The start_period gives the container 10 seconds to start up before the first healthcheck is even run.
But this is rarely what you want. A ping to localhost in a container just tells you if the container’s networking stack is up, not if your actual application inside is ready to serve requests.
The Real Problem: Testing the Application, Not the Container
The core issue is that healthchecks should verify the application’s ability to respond to its intended workload, not just the container’s basic operational status. For web applications, this typically means checking if the HTTP server is running and returning a successful status code.
Let’s look at a more realistic scenario for a web service. Imagine my-web-app is a Node.js application listening on port 3000.
version: '3.8'
services:
my-web-app:
image: node:18-alpine
working_dir: /app
volumes:
- ./app:/app
command: ["node", "server.js"]
ports:
- "3000:3000"
healthcheck:
test: ["CMD-SHELL", "wget --quiet --spider http://localhost:3000/health || exit 1"]
interval: 15s
timeout: 10s
retries: 5
start_period: 30s
Here, wget --quiet --spider http://localhost:3000/health || exit 1 is the key.
wget --quiet --spider: This tellswgetto be silent and not download the page, just check if it’s accessible.http://localhost:3000/health: This is the endpoint your application should expose to indicate its health. It’s common practice to have a dedicated/healthor/statusendpoint.|| exit 1: This is crucial. Ifwgetfails (e.g., connection refused, 404, 500), it returns a non-zero exit code. The|| exit 1ensures that theCMD-SHELLcommand itself returns a non-zero exit code ifwgetfails, which Docker interprets as an unhealthy state.
Why start_period is Your Best Friend
The start_period is often the most misunderstood and underutilized parameter. It provides a grace period after the container starts during which any healthcheck failures are ignored. This is vital for applications that take time to initialize, connect to databases, or perform other startup tasks.
In the my-web-app example, start_period: 30s means Docker won’t mark the container as unhealthy for the first 30 seconds, even if the /health endpoint isn’t ready yet. You need to set this to be longer than your application’s typical startup time.
Common Pitfalls and Their Fixes:
-
Testing the wrong port/service:
- Diagnosis: Your
healthcheckcommand connects tolocalhost:80but your app listens onlocalhost:8080. - Fix: Ensure the
testcommand uses the correct internal port your application is bound to. Formy-web-app, it’shttp://localhost:3000/health.
- Diagnosis: Your
-
Healthcheck endpoint not implemented or returning errors:
- Diagnosis: Your application’s
/healthendpoint is returning 500 Internal Server Error, or it’s not implemented at all. - Fix: Implement a
/healthendpoint in your application that returns a200 OKstatus code only when the application is fully operational (e.g., database connections are valid, critical services are reachable). If the app has dependencies, the healthcheck should ideally verify those too.
- Diagnosis: Your application’s
-
start_periodis too short:- Diagnosis: Your app takes 45 seconds to start, but
start_periodis set to 10 seconds, leading to premature "unhealthy" marks. - Fix: Increase
start_periodto be comfortably longer than your application’s maximum expected startup time. For complex apps, this might be 60 seconds or more.
- Diagnosis: Your app takes 45 seconds to start, but
-
Network issues within the container (less common for
localhost):- Diagnosis: While
ping localhostmight pass, an application trying to connect to another service within the same Docker network might fail due to misconfigured DNS or network overlays. - Fix: If your healthcheck involves inter-service communication, use the service name instead of
localhost(e.g.,wget http://my-database:5432/health || exit 1). Ensure DNS resolution is working for service names.
- Diagnosis: While
-
Overly aggressive
timeoutorretries:- Diagnosis: Your healthcheck command is slow to execute (e.g., a complex database query) and times out, or network latency causes intermittent failures that exceed
retries. - Fix: Increase
timeoutto give the command sufficient time. Increaseretriesif you expect transient network issues or slow responses, but be careful not to mask genuine problems. A common pattern istimeout= 2-3x the expected command execution time, andretries= 3-5.
- Diagnosis: Your healthcheck command is slow to execute (e.g., a complex database query) and times out, or network latency causes intermittent failures that exceed
-
Using
CMDinstead ofCMD-SHELLwhen a shell is needed:- Diagnosis: Your
testis["curl", "http://localhost:3000/health"]. This works ifcurlis in the image. But if you need shell logic like|| exit 1, you needCMD-SHELL. - Fix: Use
test: ["CMD-SHELL", "curl --fail http://localhost:3000/health"]. The--failflag forcurlmakes it return a non-zero exit code on HTTP errors, achieving the same aswget ... || exit 1.
- Diagnosis: Your
The Next Step: Orchestrating Healthchecks
Once your individual service healthchecks are robust, you’ll want to understand how Docker Compose uses these statuses. For example, you might want to deploy a new version of your application and only switch traffic over once the new instances are healthy. This is where orchestrators like Swarm or Kubernetes come in, but understanding your docker-compose healthchecks is the foundational step.