Route 53’s failover routing policy doesn’t actually check if your service is healthy; it just blindly sends traffic to the secondary endpoint if the primary appears unhealthy based on its own health checks.

Let’s watch this in action. Imagine we have two identical ECS services, one in us-east-1 (primary) and one in us-west-2 (secondary). We’ve configured Route 53 to point our domain, myapp.example.com, to the primary, with the secondary as a failover.

Here’s a simplified view of the Route 53 record set:

{
    "Comment": "Failover routing for myapp.example.com",
    "Changes": [
        {
            "Action": "UPSERT",
            "ResourceRecordSet": {
                "Name": "myapp.example.com",
                "Type": "A",
                "SetIdentifier": "primary-us-east-1",
                "Failover": "PRIMARY",
                "AliasTarget": {
                    "HostedZoneId": "Z1UJRXOUMOOFQ8", // us-east-1 ELB hosted zone ID
                    "DNSName": "my-elb-primary-1234567890.us-east-1.elb.amazonaws.com",
                    "EvaluateTargetHealth": true
                },
                "TTL": 60
            }
        },
        {
            "Action": "UPSERT",
            "ResourceRecordSet": {
                "Name": "myapp.example.com",
                "Type": "A",
                "SetIdentifier": "secondary-us-west-2",
                "Failover": "SECONDARY",
                "AliasTarget": {
                    "HostedZoneId": "Z3BJ6K684X502I", // us-west-2 ELB hosted zone ID
                    "DNSName": "my-elb-secondary-0987654321.us-west-2.elb.amazonaws.com",
                    "EvaluateTargetHealth": true
                },
                "TTL": 60
            }
        }
    ]
}

Notice EvaluateTargetHealth: true. This is crucial. Route 53 will query the health check associated with the primary ELB. If that health check fails, Route 53 will stop returning the primary ELB’s IP address and start returning the secondary ELB’s IP address.

But what exactly is that health check? It’s a separate resource you configure in Route 53. It doesn’t inherently know about your ECS service’s internal state, only what the ELB reports. The ELB, in turn, has its own health check configured to ping a specific path on your service, like /healthz.

When your ECS service tasks are running, they register with the ELB. The ELB pings /healthz on each task. If a task responds with 200 OK, the ELB considers it healthy. If it doesn’t respond, or responds with an error, the ELB marks it unhealthy. Route 53’s health check polls the ELB’s overall health status. If the ELB reports that all registered targets are unhealthy, Route 53’s health check for that endpoint fails, triggering the failover.

The actual failover mechanism is Route 53’s DNS resolution. When a user queries myapp.example.com, Route 53 checks the health of the primary endpoint. If healthy, it returns the IP(s) of the primary ELB. If unhealthy, it returns the IP(s) of the secondary ELB. This is entirely at the DNS layer, before any traffic even hits your application infrastructure.

Here’s the mental model:

  1. User Request: A user types myapp.example.com into their browser.
  2. DNS Query: The user’s resolver queries Route 53 for myapp.example.com.
  3. Route 53 Health Check: Route 53 checks the health of the primary ELB (my-elb-primary-1234567890.us-east-1.elb.amazonaws.com) via its configured health check.
  4. ELB Health: The ELB, in turn, checks the health of the ECS tasks registered with it.
  5. Decision:
    • If the Route 53 health check for the primary ELB passes (meaning the ELB thinks its targets are healthy), Route 53 returns the primary ELB’s IPs. Traffic goes to us-east-1.
    • If the Route 53 health check for the primary ELB fails (meaning the ELB thinks its targets are unhealthy), Route 53 returns the secondary ELB’s IPs. Traffic goes to us-west-2.
  6. Application Traffic: Traffic hits the chosen ELB, which then forwards it to healthy ECS tasks in its region.

The most surprising thing most people miss is that Route 53’s health check doesn’t directly inspect your ECS tasks. It relies entirely on the ELB’s reporting, which is itself based on the ELB’s configured health check against your tasks. You can have a perfectly fine ECS task running, but if the ELB’s health check path (/healthz) is misconfigured or the ELB itself is having an issue communicating with the task, Route 53 will still trigger a failover. The EvaluateTargetHealth flag on the Route 53 record set is the bridge, telling Route 53 to use the ELB’s reported health, not just a static IP check.

The next thing to consider is how to handle state and data consistency during a failover.

Want structured learning?

Take the full Ecs course →