Envoy’s downstream connection terminated because the upstream service it was trying to talk to either actively refused the connection or closed it prematurely. This is interesting because Envoy is the downstream client from the perspective of the upstream service, and this error means the upstream service decided it didn’t want to talk to Envoy anymore, or at least not right then.
1. Upstream Service Overloaded/Crashing
This is the most common culprit. The upstream service is so swamped it can’t even accept new connections, or it’s crashing and dropping existing ones.
- Diagnosis: Check the upstream service’s logs for errors like "too many open files," "resource temporarily unavailable," or stack traces indicating crashes. Monitor its CPU, memory, and network socket usage. If using Kubernetes,
kubectl top pod <pod-name> -n <namespace>andkubectl logs <pod-name> -n <namespace>are your friends. - Fix: Scale up the upstream service. If in Kubernetes, increase the
replicascount in your Deployment or StatefulSet. If it’s a standalone binary, increase the number of worker processes or threads. This gives the service more capacity to handle incoming connections. - Why it works: More instances or resources mean less load per instance, allowing it to accept and process connections without timing out or crashing.
2. Incorrect Upstream Service Port/IP Configuration
Envoy is trying to connect to a port or IP address that the upstream service isn’t actually listening on.
- Diagnosis: Verify the
service_nameandport_valuein your Envoy cluster configuration. Then, on the upstream service itself, check which IP address and port it’s bound to. For acurlcheck from the Envoy node (or a pod in the same network if using Kubernetes), trycurl -v <upstream_ip>:<upstream_port>. You should see a successfulHTTP/1.1 200 OKor similar, not a "Connection refused." - Fix: Correct the
service_nameandport_valuein your Envoy cluster configuration to match the actual listening address and port of the upstream service. For example, if your service is listening on0.0.0.0:8080, ensure Envoy is configured to point to that. - Why it works: Envoy will now send its connection requests to the correct network endpoint where the upstream service is actively listening.
3. Network Connectivity Issues (Firewall, Security Group, Network Policy)
A firewall, security group, or Kubernetes Network Policy is blocking connections between Envoy and the upstream service.
- Diagnosis: From the Envoy pod/instance, try to
telnet <upstream_ip> <upstream_port>ornc -vz <upstream_ip> <upstream_port>. If these fail, the network path is blocked. Check firewall rules on any intervening network devices, cloud provider security groups (e.g., AWS Security Groups, Azure NSGs), or Kubernetes Network Policies. - Fix: Adjust firewall rules, security groups, or Network Policies to explicitly allow traffic from Envoy’s IP address/range to the upstream service’s IP address/range on the required port. For example, a Kubernetes Network Policy might look like this:
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-envoy-to-upstream namespace: default spec: podSelector: matchLabels: app: my-upstream-app # Label of your upstream pods policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: envoy # Label of your Envoy pods ports: - protocol: TCP port: 8080 # The port your upstream service listens on - Why it works: This explicitly permits the network packets to traverse from Envoy to the upstream service, satisfying the connectivity requirement.
4. Upstream Service Not Ready/Initialized
The upstream service’s pods are running, but the application inside hasn’t fully started and isn’t yet accepting connections. This is common during deployments or restarts.
- Diagnosis: Check the upstream service’s application logs for initialization messages. If in Kubernetes,
kubectl get pod <pod-name> -n <namespace> -o yamland look at thereadinessProbestatus. If the probe is failing, the pod won’t be considered ready to receive traffic. - Fix: Ensure the upstream service has a robust readiness probe configured that accurately reflects when the application is ready to accept connections. Increase the
initialDelaySecondsandperiodSecondsfor the readiness probe if the application takes longer to start. - Why it works: Envoy (and Kubernetes Services) use readiness probes to determine if an upstream endpoint is healthy and ready to serve traffic. A proper probe prevents Envoy from sending requests to an uninitialized service.
5. Upstream Service Graceful Shutdown Issues
The upstream service is shutting down but not handling SIGTERM (or equivalent) correctly. Instead of finishing existing requests and closing gracefully, it’s abruptly terminating connections.
- Diagnosis: Observe the upstream service’s logs during a deployment or restart. Look for messages indicating it’s shutting down. If you see errors about active connections being terminated before the process exits, it’s likely a shutdown issue.
- Fix: Implement proper graceful shutdown handling in the upstream application. This typically involves:
- Catching
SIGTERM(orSIGINT). - Stopping the listener from accepting new connections.
- Waiting for a configured timeout (e.g., 30 seconds) for existing requests to complete.
- Closing the listener and exiting. For example, in Node.js with Express:
const server = app.listen(PORT, () => console.log('Server listening')); process.on('SIGTERM', () => { console.log('SIGTERM signal received: closing HTTP server'); server.close(() => { console.log('HTTP server closed'); process.exit(0); }); }); - Catching
- Why it works: By properly handling shutdown signals, the upstream service ensures that ongoing requests are completed before closing connections, preventing premature termination errors for Envoy.
6. Upstream Service Resource Exhaustion (File Descriptors, Ephemeral Ports)
The upstream service has run out of available file descriptors or ephemeral ports to establish new outgoing connections (which is what it’s doing when accepting Envoy’s incoming connection).
- Diagnosis: Check the upstream service’s OS-level limits. Use
ulimit -nto see the file descriptor limit andsysctl net.ipv4.ip_local_port_rangefor ephemeral ports. If these are low and the service is experiencing high connection churn, this could be the cause. Monitornetstat -anp | grep <upstream_pid>for a large number ofCLOSE_WAITorTIME_WAITstates. - Fix: Increase the
nofilelimit in/etc/security/limits.confor via systemd service unit files for the upstream service’s user. Also, consider increasing the ephemeral port range if it’s very small. Example forlimits.conf:
And for<upstream_user> soft nofile 65536 <upstream_user> hard nofile 65536sysctl:
(Remember to makesudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"sysctlchanges persistent by editing/etc/sysctl.conf). - Why it works: Increasing these OS-level limits provides the upstream service with more resources to manage its network connections, preventing it from failing to accept new ones due to exhaustion.
The next error you’ll likely see after fixing this is UpstreamConnectionTermination if the upstream service is still actively closing connections, or potentially a different Envoy error if the underlying problem shifts.