DNS health checks are a surprisingly brittle mechanism for achieving automatic failover, often failing to account for the subtle nuances of network and application state.

Let’s say you’re running a critical web service. You’ve got two identical servers, web1.example.com and web2.example.com, both serving your application. To make this resilient, you want a single DNS name, app.example.com, to point to whichever server is healthy. When web1 goes down, you want app.example.com to automatically start resolving to web2.

Here’s how that might look in practice, using a hypothetical DNS provider that supports health checks (many real-world providers like AWS Route 53, Cloudflare, or Akamai have similar features).

Configuration Snippet (Conceptual):

{
  "app.example.com": {
    "records": [
      {
        "type": "A",
        "value": "192.0.2.1", // IP of web1.example.com
        "health_check": {
          "protocol": "HTTP",
          "port": 80,
          "path": "/healthz",
          "interval_seconds": 30,
          "failure_threshold": 3,
          "success_threshold": 2,
          "request_timeout_seconds": 5
        }
      },
      {
        "type": "A",
        "value": "192.0.2.2", // IP of web2.example.com
        "health_check": {
          "protocol": "HTTP",
          "port": 80,
          "path": "/healthz",
          "interval_seconds": 30,
          "failure_threshold": 3,
          "success_threshold": 2,
          "request_timeout_seconds": 5
        }
      }
    ],
    "policy": "weighted_round_robin", // Or "failover" with primary/secondary
    "failover_strategy": {
      "primary": "192.0.2.1",
      "secondary": "192.0.2.2"
    }
  }
}

In this setup, app.example.com is configured to resolve to 192.0.2.1 (web1) by default. The DNS provider’s infrastructure will periodically send an HTTP GET request to http://web1.example.com/healthz every 30 seconds. If it receives a response other than a 2xx or 3xx status code (or if the request times out after 5 seconds) for 3 consecutive checks, it marks web1 as unhealthy. Once web1 is marked unhealthy, DNS queries for app.example.com will start returning 192.0.2.2 (web2). When web1 becomes healthy again (passing 2 consecutive checks), it will be reinstated as the primary.

The /healthz endpoint on your web servers should be designed to return a 200 OK only when the application is fully operational. This means checking database connectivity, essential background workers, and any other critical dependencies. A simple ping or checking if the web server process is running isn’t enough.

The real magic happens in the DNS provider’s network. They operate a distributed system of health-checking probes. These probes are geographically diverse, meaning they don’t all come from the same data center. This helps avoid false positives where a localized network issue might take down your service for everyone.

Now, let’s talk about the subtle dangers. The DNS TTL (Time To Live) is crucial here. If app.example.com has a TTL of 300 seconds (5 minutes), even after the DNS provider switches the resolution to web2, clients that have already cached the IP for web1 will continue to use it for up to 5 minutes. This means your failover isn’t instantaneous for all users. You might need to set a very low TTL, like 60 seconds or even 30 seconds, on your failover DNS record to make the switchover feel faster. This, however, dramatically increases the load on your DNS provider and can lead to more DNS traffic and potential cache churn.

A common pitfall is a health check that’s too simplistic. Imagine your web application can start its web server and respond to HTTP requests, but it can’t connect to its database. A health check that only verifies the web server is up will pass, but your actual application is broken. The /healthz endpoint must be intelligent. It should query the database, check essential queues, and confirm that the entire application stack is functional.

Consider the scenario where your primary server web1 is experiencing intermittent network issues, but its application is otherwise fine. The health check might oscillate between passing and failing. If the failure_threshold is 3 and success_threshold is 2, a brief network blip could cause the DNS to flip to web2. Then, if web1 recovers quickly, the DNS might flip back. This rapid flapping can confuse clients and lead to unpredictable behavior. You need to tune these thresholds carefully based on your tolerance for transient failures.

Another common problem is the health check endpoint itself becoming unavailable. If your /healthz endpoint is served by the same web server that’s failing, the health check will never pass, even if the underlying application could theoretically recover. It’s best practice to have a minimal, independent service (or even a static file) serving the health check, separate from the main application logic, if possible.

Finally, think about what happens after the primary fails. If web1 is down, and you want to bring it back online, your health check needs to accurately reflect its restored health. If the application restarts but immediately runs into a dependency issue (like a database that’s slow to respond), the health check might fail again. This means the DNS might not switch back to web1 even when it appears to be "up" from an OS perspective.

The next thing you’ll likely encounter is trying to manage more complex routing scenarios, like geo-based routing or weighted traffic shifting, and realizing DNS health checks are only one piece of that puzzle.

Want structured learning?

Take the full Dns course →