Envoy’s circuit breakers aren’t about preventing electrical fires; they’re about ensuring your service mesh doesn’t cascade into a complete outage when one of its dependencies hiccups.
Let’s watch one in action. Imagine service-a making requests to service-b.
# envoy.yaml
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route:
cluster: service_b
http_filters:
- name: envoy.filters.http.router
typed_config: {}
clusters:
- name: service_b
connect_timeout: 0.25s
type: LOGICAL_DNS
lb_policy: ROUND_ROBIN
# This is where the magic happens
circuit_breakers:
thresholds:
- priority: HIGH
max_requests: 10
- priority: DEFAULT
max_requests: 100
load_assignment:
cluster_name: service_b
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: 127.0.0.1, port_value: 8080 }
If service-b is running on 127.0.0.1:8080 and service-a is sending requests to Envoy on port 10000, Envoy will forward them to service-b. Now, let’s say service-b starts to get overloaded and is slow to respond. Without circuit breakers, service-a would keep hammering it, potentially exhausting its own resources trying to connect and wait.
Envoy’s circuit breakers interrupt this. They are defined per cluster and have different priority levels (HIGH and DEFAULT). When a cluster’s request count reaches its max_requests threshold for a given priority, Envoy stops sending new requests to that cluster for that priority. It doesn’t just fail the request; it immediately returns an HTTP 503 Service Unavailable error. This is crucial: it prevents the upstream service from being overwhelmed and allows it to recover.
The max_requests is the most common setting, but there are others. max_connections limits the number of active connections to the upstream cluster. max_pending_requests caps the number of requests that are queued up waiting for a connection to become available. max_retries limits the number of times Envoy will retry a request (though this is often handled at the application level too).
Consider this configuration for service_b:
circuit_breakers:
thresholds:
- priority: DEFAULT
max_requests: 1000
max_connections: 500
max_pending_requests: 100
max_retries: 5
Here, Envoy won’t allow more than 1000 active requests at once, nor more than 500 open connections. If 100 requests are already sitting in Envoy’s queue waiting for a connection to service_b to free up, the 101st will be rejected immediately. And if a request is retried, Envoy won’t let it be retried more than 5 times by default for this cluster.
The priority field is vital for differentiating critical vs. non-critical traffic. HIGH priority requests bypass certain limits applied to DEFAULT priority requests. This is often used for health checks or critical administrative endpoints that should be serviced even when the rest of the cluster is struggling. You might set max_requests for HIGH to be much lower than DEFAULT, ensuring that essential traffic gets through while less critical traffic is blocked.
One subtle but powerful aspect is how these limits interact with upstream service health. When Envoy rejects a request due to a circuit breaker trip, it’s not just a random error. It’s a signal that the upstream is likely overloaded. Envoy’s outlier detection mechanism (often configured alongside circuit breakers) will then mark the upstream host as unhealthy for a period, further preventing requests from being sent to the struggling instance. This combination is key to graceful degradation.
The actual rejection happens at the Envoy proxy itself, before the request even reaches the network socket of the overloaded upstream service. Envoy maintains internal counters for each of these limits. When a request is accepted by Envoy and routed to a cluster, Envoy increments the relevant counter. If the counter exceeds the threshold, the request is immediately terminated with a 503. Once the upstream service recovers and starts responding faster, Envoy’s counters will naturally decrease as requests complete, and the circuit breaker will "close" again, allowing traffic to resume.
If you’ve configured circuit breakers and are still seeing issues, look at the Envoy access logs for 503 errors. The presence of these 503s, especially without any corresponding logs from your upstream service indicating it received the request, is a strong indicator that Envoy’s circuit breakers are tripping.
The next thing you’ll likely want to tune are the outlier detection settings for your clusters.