Your Docker Compose healthchecks are probably not working because they’re not testing the right thing, or they’re testing it too aggressively.

Here’s a simple docker-compose.yml that uses a basic ping healthcheck for a service called my-app.

version: '3.8'

services:
  my-app:
    image: alpine:latest
    command: ["sleep", "infinity"]
    healthcheck:
      test: ["CMD-SHELL", "ping -c 1 localhost"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 10s

When you run docker compose up -d, Docker will execute the ping -c 1 localhost command every 10 seconds. If it fails 3 times within a 5-second timeout, the container is marked as unhealthy. The start_period gives the container 10 seconds to start up before the first healthcheck is even run.

But this is rarely what you want. A ping to localhost in a container just tells you if the container’s networking stack is up, not if your actual application inside is ready to serve requests.

The Real Problem: Testing the Application, Not the Container

The core issue is that healthchecks should verify the application’s ability to respond to its intended workload, not just the container’s basic operational status. For web applications, this typically means checking if the HTTP server is running and returning a successful status code.

Let’s look at a more realistic scenario for a web service. Imagine my-web-app is a Node.js application listening on port 3000.

version: '3.8'

services:
  my-web-app:
    image: node:18-alpine
    working_dir: /app
    volumes:
      - ./app:/app
    command: ["node", "server.js"]
    ports:
      - "3000:3000"
    healthcheck:
      test: ["CMD-SHELL", "wget --quiet --spider http://localhost:3000/health || exit 1"]
      interval: 15s
      timeout: 10s
      retries: 5
      start_period: 30s

Here, wget --quiet --spider http://localhost:3000/health || exit 1 is the key.

  • wget --quiet --spider: This tells wget to be silent and not download the page, just check if it’s accessible.
  • http://localhost:3000/health: This is the endpoint your application should expose to indicate its health. It’s common practice to have a dedicated /health or /status endpoint.
  • || exit 1: This is crucial. If wget fails (e.g., connection refused, 404, 500), it returns a non-zero exit code. The || exit 1 ensures that the CMD-SHELL command itself returns a non-zero exit code if wget fails, which Docker interprets as an unhealthy state.

Why start_period is Your Best Friend

The start_period is often the most misunderstood and underutilized parameter. It provides a grace period after the container starts during which any healthcheck failures are ignored. This is vital for applications that take time to initialize, connect to databases, or perform other startup tasks.

In the my-web-app example, start_period: 30s means Docker won’t mark the container as unhealthy for the first 30 seconds, even if the /health endpoint isn’t ready yet. You need to set this to be longer than your application’s typical startup time.

Common Pitfalls and Their Fixes:

  1. Testing the wrong port/service:

    • Diagnosis: Your healthcheck command connects to localhost:80 but your app listens on localhost:8080.
    • Fix: Ensure the test command uses the correct internal port your application is bound to. For my-web-app, it’s http://localhost:3000/health.
  2. Healthcheck endpoint not implemented or returning errors:

    • Diagnosis: Your application’s /health endpoint is returning 500 Internal Server Error, or it’s not implemented at all.
    • Fix: Implement a /health endpoint in your application that returns a 200 OK status code only when the application is fully operational (e.g., database connections are valid, critical services are reachable). If the app has dependencies, the healthcheck should ideally verify those too.
  3. start_period is too short:

    • Diagnosis: Your app takes 45 seconds to start, but start_period is set to 10 seconds, leading to premature "unhealthy" marks.
    • Fix: Increase start_period to be comfortably longer than your application’s maximum expected startup time. For complex apps, this might be 60 seconds or more.
  4. Network issues within the container (less common for localhost):

    • Diagnosis: While ping localhost might pass, an application trying to connect to another service within the same Docker network might fail due to misconfigured DNS or network overlays.
    • Fix: If your healthcheck involves inter-service communication, use the service name instead of localhost (e.g., wget http://my-database:5432/health || exit 1). Ensure DNS resolution is working for service names.
  5. Overly aggressive timeout or retries:

    • Diagnosis: Your healthcheck command is slow to execute (e.g., a complex database query) and times out, or network latency causes intermittent failures that exceed retries.
    • Fix: Increase timeout to give the command sufficient time. Increase retries if you expect transient network issues or slow responses, but be careful not to mask genuine problems. A common pattern is timeout = 2-3x the expected command execution time, and retries = 3-5.
  6. Using CMD instead of CMD-SHELL when a shell is needed:

    • Diagnosis: Your test is ["curl", "http://localhost:3000/health"]. This works if curl is in the image. But if you need shell logic like || exit 1, you need CMD-SHELL.
    • Fix: Use test: ["CMD-SHELL", "curl --fail http://localhost:3000/health"]. The --fail flag for curl makes it return a non-zero exit code on HTTP errors, achieving the same as wget ... || exit 1.

The Next Step: Orchestrating Healthchecks

Once your individual service healthchecks are robust, you’ll want to understand how Docker Compose uses these statuses. For example, you might want to deploy a new version of your application and only switch traffic over once the new instances are healthy. This is where orchestrators like Swarm or Kubernetes come in, but understanding your docker-compose healthchecks is the foundational step.

Want structured learning?

Take the full Docker course →