Envoy is refusing to connect to your upstream services because it’s exhausted its retry budget for that particular host.

Here’s what’s actually happening: Envoy has a configured maximum number of times it will attempt to send a request to an upstream host before giving up. When this limit is hit, it returns a max_retries_exceeded error to the client. This isn’t just a client-side issue; it indicates Envoy itself couldn’t establish a successful connection or receive a valid response within its allowed attempts.

Common Causes and Fixes:

  1. Upstream Service Unhealthy/Crashing:

    • Diagnosis: Check the health status of your upstream service instances. If using Kubernetes, kubectl get pods -l app=your-service -o wide. Look for pods in CrashLoopBackOff or Error states. Check application logs for the upstream service.
    • Fix: Address the root cause of the upstream service instability. This might involve fixing bugs, increasing resource limits (CPU/memory), or ensuring proper application startup.
    • Why it works: If the upstream service is constantly crashing or not starting, Envoy will repeatedly attempt to connect to a non-existent or unhealthy endpoint, quickly exhausting retries.
  2. Upstream Service Overloaded (Slow Responses):

    • Diagnosis: Monitor your upstream service’s CPU, memory, and request latency. Tools like Prometheus/Grafana or cloud provider monitoring can show spikes. If Envoy is configured with timeouts (e.g., connect_timeout, request_timeout), these might be firing before the upstream can respond, triggering a retry.
    • Fix: Scale up your upstream service (more instances) or optimize its performance. For Envoy, you might temporarily increase request_timeout in your cluster configuration if the upstream is just slow but generally healthy, e.g., request_timeout: 30s instead of the default 5s. Be cautious: this masks underlying issues.
    • Why it works: If the upstream is too slow to respond within Envoy’s configured timeout, Envoy treats it as a failure, retries, and repeats until the retry budget is spent.
  3. Network Connectivity Issues Between Envoy and Upstream:

    • Diagnosis: From the Envoy pod/instance, try to curl the upstream service’s IP and port directly. If using Kubernetes, kubectl exec <envoy-pod-name> -- curl -v <upstream-service-ip>:<upstream-port>. Check firewall rules, security groups, and network policies that might be blocking traffic.
    • Fix: Correct network misconfigurations. Ensure that the network path between Envoy and the upstream service is open and performant. This could involve updating security group rules or Kubernetes NetworkPolicies.
    • Why it works: If packets are being dropped or delayed due to network issues, Envoy won’t receive a successful response and will keep retrying.
  4. Incorrect Upstream Host/Port Configuration in Envoy:

    • Diagnosis: Verify the hosts field in your Envoy cluster configuration. For Kubernetes, this is often dynamically populated via EDS or by resolving service DNS. Ensure the IP addresses and ports Envoy is trying to reach are correct and match where your upstream service is actually listening.
    • Fix: Correct the hosts configuration in your Envoy cluster definition. If using Kubernetes, ensure your service discovery mechanism (e.g., Istio’s ServiceEntry, Kubernetes Service definition) is pointing to the correct endpoints.
    • Why it works: Envoy is programmed to connect to specific addresses. If those addresses are wrong, it will attempt connections to non-existent or incorrect services, leading to immediate failures and retries.
  5. Envoy’s Retry Budget Too Low:

    • Diagnosis: Examine your Envoy cluster configuration for max_retries. The default is often 3. If you have transient network glitches or brief upstream hiccups, a low retry count can be exhausted quickly.
    • Fix: Increase max_retries in your cluster configuration. For example, to allow 5 retries:
      static_resources:
        clusters:
        - name: your_service_cluster
          connect_timeout: 5s
          type: LOGICAL_DNS
          dns_lookup_family: V4_ONLY
          lb_policy: ROUND_ROBIN
          max_retries: 5  # Increased from default 3
          load_assignment:
            cluster_name: your_service_cluster
            endpoints:
            - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: your-upstream-service
                      port_value: 8080
      
    • Why it works: A higher retry count gives Envoy more chances to succeed if the upstream service is temporarily unavailable or experiencing brief network instability.
  6. Upstream Service Connection Limits Reached:

    • Diagnosis: Check the upstream application’s logs and its operational metrics for connection counts. Many services have a maximum number of concurrent connections they will accept. If Envoy is trying to connect (and potentially retrying) while the upstream is already at capacity, new connections will be rejected.
    • Fix: Increase the connection limit on your upstream service or scale up the number of upstream service instances.
    • Why it works: If the upstream service actively rejects new connections because it’s at its limit, Envoy will receive a connection refused error, retry, and eventually hit its own retry limit.
  7. TLS Handshake Failures:

    • Diagnosis: If you’re using TLS between Envoy and the upstream, check Envoy’s logs for TLS-related errors and the upstream service’s logs for connection attempts that fail during the handshake. Ensure certificates are valid, trusted by both sides, and cipher suites are compatible.
    • Fix: Resolve TLS configuration issues. This might involve updating certificates, ensuring correct CA chains are configured, or aligning TLS versions and cipher suites.
    • Why it works: A failed TLS handshake is a connection-level error. Envoy will attempt to retry the connection, and if the TLS issue persists, it will exhaust its retry budget.

After fixing these, the next error you’ll likely encounter is a 503 Service Unavailable or a 504 Gateway Timeout if the upstream service itself is responding with errors or taking too long after a successful connection.

Want structured learning?

Take the full Envoy course →