Envoy doesn’t actually balance load in the way you might think; it delegates that decision entirely to the individual upstream hosts.

Let’s see what that looks like. Imagine you have a service running on three different machines: 10.0.0.1:8080, 10.0.0.2:8080, and 10.0.0.3:8080. In your Envoy configuration, you’d define a cluster for this service, let’s call it my_service.

static_resources:
  clusters:
  - name: my_service
    type: STRICT_DNS
    connect_timeout: 1s
    lb_policy: ROUND_ROBIN
    hosts:
    - socket_address:
        address: 10.0.0.1
        port_value: 8080
    - socket_address:
        address: 10.0.0.2
        port_value: 8080
    - socket_address:
        address: 10.0.0.3
        port_value: 8080
    health_checks:
    - timeout: 1s
      interval: 5s
      interval_jitter: 1s
      http_health_check:
        path: "/healthz"

Here’s the breakdown of what’s happening and why it’s interesting:

  • name: my_service: This is just a label for this group of upstream hosts. When you configure a route to send traffic to this service, you’ll reference my_service.
  • type: STRICT_DNS: This tells Envoy to resolve the hostnames (or IP addresses in this case) using DNS. When the IP addresses change, Envoy will periodically re-resolve them. Other types include LOGICAL_DNS (for more complex DNS setups) and STATIC (where you hardcode IPs and don’t expect them to change).
  • connect_timeout: 1s: If Envoy can’t establish a TCP connection to an upstream host within 1 second, it will consider that host unhealthy for this request. This is a crucial tuning parameter to avoid long waits for unresponsive services.
  • lb_policy: ROUND_ROBIN: This is where the "load balancing" comes in, but it’s very simple. Envoy cycles through the healthy hosts in the order they are listed. If you had LEAST_REQUEST, Envoy would try to send traffic to the host with the fewest active connections. RING_HASH and Maglev are for consistent hashing, useful for stateful applications.
  • hosts: This is the core of your upstream service definition. Each socket_address entry is a distinct instance of your service.
  • health_checks: This is where Envoy actively monitors the health of your upstream hosts.
    • timeout: 1s: The health check probe itself must complete within 1 second.
    • interval: 5s: Envoy will send a health check probe every 5 seconds.
    • interval_jitter: 1s: To avoid thundering herd problems where all health checks hit at the exact same millisecond, Envoy adds a random jitter of up to 1 second to the interval.
    • http_health_check: { path: "/healthz" }: This specifies that Envoy should make an HTTP GET request to the /healthz path on the upstream host. A 2xx or 3xx response is considered healthy. You can also configure TCP or gRPC health checks.

When a request comes into Envoy destined for my_service, Envoy will:

  1. Look at the list of hosts for my_service.
  2. Filter out any hosts that are currently marked as unhealthy due to failed health checks or connection timeouts.
  3. Apply the lb_policy (e.g., ROUND_ROBIN) to select one of the healthy hosts.
  4. Attempt to establish a connection to the selected host. If this connection fails within the connect_timeout, that host is marked unhealthy, and Envoy will try another healthy host for this specific request.

The most surprising thing about Envoy’s health checking is that it’s entirely passive from the upstream’s perspective; the upstream service doesn’t need to know Envoy is checking it, it just needs to respond to the probes.

The hosts list can also be populated dynamically via service discovery (like Kubernetes, Consul, or file-based discovery). Instead of socket_address, you’d use upstream_config pointing to a discovery service.

The one thing most people don’t know is that Envoy’s health checking is tiered. It has active health checking (the probes you configure) and passive health checking, which involves ejecting hosts from the load balancing pool based on connection failures and outlier detection. This means even if your /healthz endpoint is returning 200 OK, Envoy can still mark a host as unhealthy if it’s consistently failing to establish TCP connections or is returning specific error codes that you’ve configured for outlier detection.

Once you have your clusters configured, the next logical step is to define how incoming requests are routed to these clusters using routes and listeners.

Want structured learning?

Take the full Envoy course →