Envoy’s outlier detection doesn’t just mark hosts as unhealthy; it actively ejects them from the load balancing pool until they recover.
Let’s see this in action. Imagine a simple setup where a frontend service talks to a backend service. The backend has three instances, and Envoy is load balancing requests to them.
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/"
route:
cluster: backend_cluster
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: backend_cluster
connect_timeout: 0.25s
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
# Outlier detection configuration
outlier_detection:
consecutive_5xx_errors: 1
interval: 5s
base_ejection_time: 30s
max_ejection_percent: 50
enforcing_consecutive_5xx_errors: 1
enforcing_success_rate: 1
success_rate_minimum_hosts: 1
success_rate_request_volume: 10
success_rate_iov_fail_open: false
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8081
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8082
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8083
The backend_cluster is configured with outlier detection. Specifically, consecutive_5xx_errors is set to 1, meaning if a single request to a backend instance results in a 5xx HTTP response, that instance will be considered unhealthy. interval is 5s, so Envoy checks for ejected hosts every 5 seconds. base_ejection_time is 30s, meaning a host is ejected for at least 30 seconds. max_ejection_percent is 50, so Envoy won’t eject more than half of the hosts in the cluster, preventing a complete outage if many hosts become unhealthy simultaneously. enforcing_consecutive_5xx_errors and enforcing_success_rate are set to 1, meaning these rules are actively enforced.
Now, let’s simulate a failure. If 127.0.0.1:8081 starts returning 503 Service Unavailable errors, Envoy will detect this.
# Simulate a 503 from backend instance 1
curl -v -X GET http://127.0.0.1:10000/ -o /dev/null -w "%{http_code}\n" -s --fail -m 5 \
--retry 0 -H "X-Simulate-Error: 503"
After one such request, Envoy’s outlier detection kicks in. It will increment the consecutive_5xx_errors counter for 127.0.0.1:8081. Since enforcing_consecutive_5xx_errors is 1, this is enough to mark the host as an outlier. Envoy will then eject 127.0.0.1:8081 from the load balancing pool for the backend_cluster.
You can observe this by querying Envoy’s admin interface for cluster health:
curl -s http://127.0.0.1:9901/clusters?format=json | jq '.clusters[].name'
curl -s http://127.0.0.1:9901/clusters/backend_cluster?format=json | jq '.cluster_load_assignment.endpoints'
Initially, you’ll see all three endpoints listed. After the ejection, 127.0.0.1:8081 will disappear from the active endpoints. Envoy will continue to send requests to 127.0.0.1:8082 and 127.0.0.1:8083.
After base_ejection_time (30 seconds) has passed, Envoy will start sending a small, infrequent probe request to the ejected host (127.0.0.1:8081) to check if it has recovered. If these probes succeed, the host will be re-added to the load balancing pool.
The success_rate outlier detection works by tracking the success rate of requests over a sliding window (defined by success_rate_request_volume and success_rate_minimum_hosts). If a host’s success rate drops below a certain threshold (calculated based on the overall success rate of the cluster), it can also be ejected. This is particularly useful for intermittent issues that don’t immediately result in 5xx errors but degrade performance.
The enforcing_success_rate parameter means that Envoy will actively eject hosts based on success rate criteria. If success_rate_iov_fail_open is false (the default), Envoy will stop sending requests to a host whose success rate has dropped too low, preventing further degradation. If set to true, Envoy will continue to send requests, potentially leading to a cascading failure if the problem is widespread.
The interval setting is crucial for how frequently Envoy evaluates its ejected hosts for potential re-addition and recalculates success rates. A shorter interval means faster detection of recovery but also more frequent health checks.
The max_ejection_percent is a safety valve. If Envoy detects that more than 50% of the cluster’s hosts are unhealthy, it will stop ejecting further hosts. This prevents a situation where Envoy might accidentally eject all healthy hosts if there’s a widespread but temporary issue affecting a large portion of the backend.
A common pitfall is setting consecutive_5xx_errors too low (like 0) or base_ejection_time too short. This can lead to "flapping," where hosts are repeatedly ejected and re-added, causing instability. Conversely, setting base_ejection_time too high might mean unhealthy hosts remain out of rotation for too long, impacting availability.
The next thing you’ll likely encounter is dealing with upstream connection timeouts and understanding how they interact with outlier detection.