Envoy’s UpstreamConnectionFailure flag means a connection to an upstream service couldn’t be established at all.
Common Causes and Fixes for UpstreamConnectionFailure
-
Upstream Service Unreachable (Network Policy/Firewall)
- Diagnosis:
curl -v <upstream_host>:<upstream_port>from the Envoy pod. If this fails with "Connection refused" or "Network is unreachable" even though the upstream should be running, it’s likely a network policy or firewall blocking traffic. Checkkubectl get networkpolicy -n <namespace>to see if any policies are restricting egress from the Envoy pod or ingress to the upstream pod. - Fix: Adjust network policies to allow traffic from the Envoy pod’s namespace and service account to the upstream service’s namespace and selector on the specified port. For example, if using Calico, you might need to add an
orderfield to yourNetworkPolicyto ensure it’s evaluated correctly. - Why it works: Network policies act as firewalls within the Kubernetes cluster. If a policy is too restrictive, it prevents packets from reaching their destination, manifesting as a connection failure before any application-level handshake can occur.
- Diagnosis:
-
Upstream Service Not Running or Crashed
- Diagnosis:
kubectl get pods -n <namespace> -l <upstream_selector>to verify the upstream pods are in aRunningstate. Checkkubectl logs <upstream_pod_name> -n <namespace>for any crash loops or errors. - Fix: If pods are not running, investigate the
kubectl describe pod <upstream_pod_name> -n <namespace>output for reasons like image pull errors, resource constraints (CPU/memory), or invalid configurations. Correct the underlying issue and restart the upstream deployment. - Why it works: Envoy can’t connect to an upstream service if there are no healthy instances of that service listening on the configured port.
- Diagnosis:
-
Incorrect Upstream Hostname/IP in Envoy Configuration
- Diagnosis: Examine the Envoy configuration (e.g.,
bootstrap.yamlor KubernetesConfigMapfor theVirtualServiceorDestinationRuleif using Istio). Specifically, check thehostsoraddressfields within theclusterdefinition for your upstream service. Then, try topingorcurlthat exact hostname/IP from a pod in the same network namespace as Envoy (e.g.,kubectl exec -it <envoy_pod_name> -n <namespace> -- shand thenping <upstream_hostname>). - Fix: Correct the hostname or IP address in the Envoy configuration to accurately point to the upstream service’s actual network address. For Kubernetes services, this is typically the service name (e.g.,
my-upstream-service.my-namespace.svc.cluster.local). - Why it works: Envoy relies on its configuration to know where to send traffic. If the destination address is wrong, it will attempt to connect to a non-existent or incorrect IP, resulting in a connection failure.
- Diagnosis: Examine the Envoy configuration (e.g.,
-
Upstream Service Port Mismatch
- Diagnosis: Verify the
portconfiguration in Envoy’s cluster definition against thecontainerPortortargetPortdefined for the upstream Kubernetes Service or directly on the upstream pod. You can check the Kubernetes Service withkubectl get svc <upstream_service_name> -n <namespace> -o yaml. - Fix: Update the
portin Envoy’s cluster configuration to match the port the upstream service is actually listening on. - Why it works: Even if Envoy can resolve the upstream host, it needs to attempt connection on the correct port. If the ports don’t align, the connection will be refused by the upstream’s network stack.
- Diagnosis: Verify the
-
Upstream Service Overloaded (Connection Refused or Timeout during TCP Handshake)
- Diagnosis: While
UpstreamConnectionFailuretypically indicates a failure before a successful TCP handshake, in high-load scenarios, the upstream might be so swamped that it actively rejects new connections or the TCP handshake times out very early. Check upstream service metrics for high CPU/memory usage, thread saturation, or excessive connection counts. Usekubectl top pod <upstream_pod_name> -n <namespace>and monitor application-level metrics. - Fix: Scale up the upstream service (increase replica count) or optimize its performance to handle the incoming load. Ensure the upstream application is configured to accept a sufficient number of concurrent connections.
- Why it works: An overloaded upstream service might not have the resources to accept new TCP connections, leading the kernel to return "connection refused" or the connection attempt to time out before a full handshake can complete.
- Diagnosis: While
-
DNS Resolution Failure for Upstream Host
- Diagnosis: From the Envoy pod, try resolving the upstream service’s hostname using
nslookup <upstream_hostname>ordig <upstream_hostname>. If these commands fail or return incorrect IPs, DNS is the problem. - Fix: Ensure the Kubernetes DNS (like CoreDNS) is functioning correctly within the cluster. Verify that the upstream service’s DNS record is correctly registered. Check the
/etc/resolv.conffile inside the Envoy pod to ensure it points to the correct cluster DNS service. - Why it works: Envoy uses DNS to translate upstream service hostnames into IP addresses. If DNS resolution fails, Envoy cannot determine where to send the traffic, leading to a connection failure.
- Diagnosis: From the Envoy pod, try resolving the upstream service’s hostname using
The next error you’ll likely encounter if you fix all of these is an UpstreamResponseFlag like NoHealthyUpstream, indicating that while Envoy can reach the upstream network endpoint, there are no healthy instances of the service available to handle the request.