Envoy’s health checking isn’t just about whether a backend is up, it’s about whether it’s ready to serve traffic, and it does this with two distinct, yet complementary, mechanisms: active and passive.

Let’s see this in action. Imagine a simple upstream cluster my_service with two identical instances, 10.0.0.1:8080 and 10.0.0.2:8080.

static_resources:
  clusters:
  - name: my_service
    connect_timeout: 0.25s
    type: LOGICAL_DNS
    lb_policy: ROUND_ROBIN
    dns_refresh_rate: 5s
    # Active Health Check configuration
    health_checks:
    - timeout: 1s
      interval: 1s
      interval_jitter: 0.5s
      unhealthy_threshold: 3
      healthy_threshold: 2
      http_health_check:
        path: "/healthz"
        method: GET
        expected_status: 200
    # Passive Health Check configuration
    outlier_detection:
      consecutive_5xx_errors: 3
      interval: 10s
      base_ejection_time: 30s
      max_ejection_percent: 50
    load_assignment:
      cluster_name: my_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 10.0.0.1
                port_value: 8080
        - endpoint:
            address:
              socket_address:
                address: 10.0.0.2
                port_value: 8080

Active health checks are Envoy actively probing your upstream services. Think of them as regular, scheduled "pings" to ensure a service is responsive and returning the expected results. In the configuration above, Envoy will send a GET request to /healthz on each upstream host every 1 second (interval: 1s). If a host fails to respond within 1 second (timeout: 1s), or if the response is not a 200 OK (expected_status: 200), that failure is counted. After 3 consecutive failures (unhealthy_threshold: 3), the host is marked as unhealthy and removed from the load balancing pool. Conversely, if an unhealthy host starts responding correctly, it needs 2 consecutive successful health checks (healthy_threshold: 2) to be marked healthy again. The interval_jitter adds a small random delay to the health check interval, preventing a thundering herd of health checks from hitting all upstreams simultaneously.

Passive health checking, on the other hand, relies on observing the actual traffic flow and error responses. Envoy watches for specific symptoms of distress without needing to send its own dedicated probes. In our example, the outlier_detection section configures passive health checks. If an upstream host returns 3 consecutive 5xx errors (consecutive_5xx_errors: 3), Envoy considers it "outlying" and ejects it from the load balancing pool for a duration of 30 seconds (base_ejection_time: 30s). This ejection happens independently of active health checks. The interval: 10s dictates how often Envoy re-evaluates ejection periods for hosts that have been ejected. max_ejection_percent: 50 ensures that Envoy doesn’t eject more than half of the available hosts, preventing a complete service outage if multiple hosts are genuinely struggling.

Together, active and passive checks provide a robust system. Active checks catch services that are completely unresponsive or misconfigured, while passive checks catch services that are technically running but failing under load or experiencing transient issues that might not be immediately apparent to a simple probe. This dual approach ensures that traffic is only sent to healthy, responsive endpoints, minimizing user-facing errors.

The most surprising truth about Envoy’s health checking is that active health checks can be configured to ignore the results of passive health checks, and vice-versa, through a combination of configuration and subtle interactions.

The next concept to explore is how these health check states are exposed via Envoy’s statistics and administrative interface, allowing you to monitor their behavior in real-time.

Want structured learning?

Take the full Envoy course →