Envoy retries aren’t just about re-sending a failing request; they’re a sophisticated mechanism for masking transient network issues and intermittent service failures by intelligently re-routing traffic.
Let’s see Envoy retry in action. Imagine a frontend service (frontend-app) calling a backend service (user-service). If user-service is momentarily overloaded and returns a 503 (Service Unavailable), frontend-app shouldn’t immediately fail. Instead, Envoy, acting as a sidecar for frontend-app, can intercept that 503 and retry the request to a different instance of user-service.
Here’s a snippet from an Envoy configuration showing a retry policy:
static_resources:
listeners:
- name: listener_0
address:
socket_address:
address: 0.0.0.0
port_value: 10000
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match:
prefix: "/users"
route:
cluster: user_service
# Retry policy configuration
retry_policy:
retry_on: 5xx,gateway_error,connect-failure,refused-stream
num_retries: 3
per_try_timeout: 2s
base_interval: 0.1s
max_interval: 1s
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: user_service
connect_timeout: 0.25s
type: LOGICAL_DNS
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
outlier_detection:
consecutive_5xx_errors: 1
interval: 10s
base_ejection_time: 30s
hosts:
- socket_address:
address: 10.0.0.1
port_value: 8080
- socket_address:
address: 10.0.0.2
port_value: 8080
The core problem Envoy retries solve is transient failures. Network hiccups, a brief spike in CPU load on a service instance, a quick garbage collection pause – these are all common, short-lived issues that don’t indicate a fundamental problem with the service. Without retries, a single such blip can cascade into a user-facing error. Envoy’s retry policy allows the system to gracefully absorb these minor disturbances.
Internally, when Envoy receives a response from an upstream cluster that matches the retry_on criteria, it doesn’t immediately propagate that response. Instead, it consults the retry_policy. It will then attempt to re-send the request, potentially to a different upstream host if load balancing is configured. The num_retries dictates how many times it will attempt this. per_try_timeout is crucial: it sets a deadline for each individual attempt, preventing a single slow request from holding up the retry mechanism indefinitely. base_interval and max_interval control the backoff strategy between retries; Envoy uses exponential backoff with jitter by default, meaning it waits longer between subsequent retries but with some randomness to avoid thundering herds.
The retry_on field is a comma-separated list of conditions that trigger a retry. Common values include:
5xx: Any HTTP 5xx response (500, 502, 503, 504).gateway_error: Specifically 502, 503, 504.connect-failure: Network-level connection errors.refused-stream: TCP connection reset or RST on a connection.retriable_4xx: Specific 4xx errors that might be retriable (e.g., 429 Too Many Requests).cancelled: Request cancelled by a peer.
The per_try_timeout is a powerful knob. If you set num_retries: 3 and per_try_timeout: 2s, the total time spent on retries for a single request could be up to (3 + 1) * 2s = 8s (the initial request plus 3 retries, each potentially taking 2s). This is why setting per_try_timeout too high can mask actual service slowness. It’s often set in conjunction with a timeout at the HTTP connection manager level or cluster level, which acts as the overall deadline for the request including retries.
The retry_policy is applied at the cluster level in the route configuration. This means you can have different retry strategies for different upstream services. For critical services, you might have more aggressive retries. For services where retries could be harmful (e.g., idempotent operations that shouldn’t be repeated), you’d disable retries.
One aspect often overlooked is how retries interact with idempotency. Envoy doesn’t inherently know if an operation is idempotent. If you retry a POST request that creates a resource, you might end up with duplicates if the first attempt actually succeeded but the response was lost. This is why retry_on is carefully chosen, and it’s often best to only retry on conditions that are guaranteed to be transient and non-idempotent-breaking, like network errors or 503s, and to ensure your upstream services are designed to handle potential duplicate requests gracefully if necessary.
The next thing you’ll likely encounter is configuring outlier_detection to actively eject unhealthy hosts that are consistently failing, complementing retries by preventing traffic from being sent to known bad instances.