Envoy’s UpstreamConnectionFailure flag means a connection to an upstream service couldn’t be established at all.

Common Causes and Fixes for UpstreamConnectionFailure

  1. Upstream Service Unreachable (Network Policy/Firewall)

    • Diagnosis: curl -v <upstream_host>:<upstream_port> from the Envoy pod. If this fails with "Connection refused" or "Network is unreachable" even though the upstream should be running, it’s likely a network policy or firewall blocking traffic. Check kubectl get networkpolicy -n <namespace> to see if any policies are restricting egress from the Envoy pod or ingress to the upstream pod.
    • Fix: Adjust network policies to allow traffic from the Envoy pod’s namespace and service account to the upstream service’s namespace and selector on the specified port. For example, if using Calico, you might need to add an order field to your NetworkPolicy to ensure it’s evaluated correctly.
    • Why it works: Network policies act as firewalls within the Kubernetes cluster. If a policy is too restrictive, it prevents packets from reaching their destination, manifesting as a connection failure before any application-level handshake can occur.
  2. Upstream Service Not Running or Crashed

    • Diagnosis: kubectl get pods -n <namespace> -l <upstream_selector> to verify the upstream pods are in a Running state. Check kubectl logs <upstream_pod_name> -n <namespace> for any crash loops or errors.
    • Fix: If pods are not running, investigate the kubectl describe pod <upstream_pod_name> -n <namespace> output for reasons like image pull errors, resource constraints (CPU/memory), or invalid configurations. Correct the underlying issue and restart the upstream deployment.
    • Why it works: Envoy can’t connect to an upstream service if there are no healthy instances of that service listening on the configured port.
  3. Incorrect Upstream Hostname/IP in Envoy Configuration

    • Diagnosis: Examine the Envoy configuration (e.g., bootstrap.yaml or Kubernetes ConfigMap for the VirtualService or DestinationRule if using Istio). Specifically, check the hosts or address fields within the cluster definition for your upstream service. Then, try to ping or curl that exact hostname/IP from a pod in the same network namespace as Envoy (e.g., kubectl exec -it <envoy_pod_name> -n <namespace> -- sh and then ping <upstream_hostname>).
    • Fix: Correct the hostname or IP address in the Envoy configuration to accurately point to the upstream service’s actual network address. For Kubernetes services, this is typically the service name (e.g., my-upstream-service.my-namespace.svc.cluster.local).
    • Why it works: Envoy relies on its configuration to know where to send traffic. If the destination address is wrong, it will attempt to connect to a non-existent or incorrect IP, resulting in a connection failure.
  4. Upstream Service Port Mismatch

    • Diagnosis: Verify the port configuration in Envoy’s cluster definition against the containerPort or targetPort defined for the upstream Kubernetes Service or directly on the upstream pod. You can check the Kubernetes Service with kubectl get svc <upstream_service_name> -n <namespace> -o yaml.
    • Fix: Update the port in Envoy’s cluster configuration to match the port the upstream service is actually listening on.
    • Why it works: Even if Envoy can resolve the upstream host, it needs to attempt connection on the correct port. If the ports don’t align, the connection will be refused by the upstream’s network stack.
  5. Upstream Service Overloaded (Connection Refused or Timeout during TCP Handshake)

    • Diagnosis: While UpstreamConnectionFailure typically indicates a failure before a successful TCP handshake, in high-load scenarios, the upstream might be so swamped that it actively rejects new connections or the TCP handshake times out very early. Check upstream service metrics for high CPU/memory usage, thread saturation, or excessive connection counts. Use kubectl top pod <upstream_pod_name> -n <namespace> and monitor application-level metrics.
    • Fix: Scale up the upstream service (increase replica count) or optimize its performance to handle the incoming load. Ensure the upstream application is configured to accept a sufficient number of concurrent connections.
    • Why it works: An overloaded upstream service might not have the resources to accept new TCP connections, leading the kernel to return "connection refused" or the connection attempt to time out before a full handshake can complete.
  6. DNS Resolution Failure for Upstream Host

    • Diagnosis: From the Envoy pod, try resolving the upstream service’s hostname using nslookup <upstream_hostname> or dig <upstream_hostname>. If these commands fail or return incorrect IPs, DNS is the problem.
    • Fix: Ensure the Kubernetes DNS (like CoreDNS) is functioning correctly within the cluster. Verify that the upstream service’s DNS record is correctly registered. Check the /etc/resolv.conf file inside the Envoy pod to ensure it points to the correct cluster DNS service.
    • Why it works: Envoy uses DNS to translate upstream service hostnames into IP addresses. If DNS resolution fails, Envoy cannot determine where to send the traffic, leading to a connection failure.

The next error you’ll likely encounter if you fix all of these is an UpstreamResponseFlag like NoHealthyUpstream, indicating that while Envoy can reach the upstream network endpoint, there are no healthy instances of the service available to handle the request.

Want structured learning?

Take the full Envoy course →