Envoy is refusing to send a response to upstream services because it believes the connection was closed prematurely by the client before any data was sent back.

Here are the common culprits:

1. Client Abruptly Closing the Connection: This is the most frequent reason. The client application (browser, another service, etc.) sends a request to Envoy and then, for whatever reason, decides to tear down the TCP connection before Envoy can even begin sending back its response. This could be due to a client-side timeout, a user closing a tab, or a bug in the client. * Diagnosis: Look at your client logs. If you see connection reset errors originating from the client’s side after sending the request but before receiving a substantial response, this is likely your issue. Envoy’s access logs will show 503 responses with the upstream_reset_before_response_started tag. * Fix: This is often a client-side problem. You might need to debug the client application’s connection handling. If you must mitigate this on Envoy’s side, you could consider increasing connection_idle_timeout on the downstream (client-facing) connection in your Envoy configuration. This gives the client more time before Envoy itself closes an idle connection, though it doesn’t directly prevent the client from closing it before a response. yaml static_resources: listeners: - name: listener_0 address: socket_address: address: 0.0.0.0 port_value: 8080 filter_chains: - filters: - name: envoy.filters.network.http_connection_manager typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager stat_prefix: ingress_http route_config: name: local_route virtual_hosts: - name: local_service domains: ["*"] routes: - match: prefix: "/" route: cluster: some_upstream_cluster http_filters: - name: envoy.filters.http.router typed_config: {} # Add this for downstream idle timeout common_http_protocol_options: idle_timeout: 300s # Example: 5 minutes * Why it works: Increasing the downstream idle timeout means Envoy won’t aggressively close connections that appear idle to Envoy, giving a misbehaving client a bit more breathing room before Envoy itself terminates.

2. Upstream Service Crashing or Resetting: The upstream service that Envoy is proxying to might be crashing, restarting, or explicitly closing the connection after Envoy has established it but before the upstream has had a chance to send any data back. This can happen if the upstream service encounters an unhandled exception during request processing or if it has its own internal timeouts. * Diagnosis: Examine the logs of your upstream service. Look for errors, crashes, or sudden connection closures. Envoy’s access logs will again show upstream_reset_before_response_started. * Fix: Address the root cause in the upstream service. If it’s a crash, fix the bug. If it’s a timeout, consider increasing the upstream’s processing timeout if feasible, or optimize the upstream service’s performance. There isn’t a direct Envoy configuration fix for an upstream crashing, but ensuring upstream health is key.

3. Network Intermediaries Dropping Connections: Firewalls, load balancers, or other network devices between Envoy and the upstream service might be aggressively closing idle TCP connections or connections they deem suspicious. These devices often have their own connection tracking and timeouts that can be shorter than Envoy’s. * Diagnosis: Check the logs of any network devices between Envoy and your upstream. Look for connection reset or timeout messages that correlate with Envoy’s errors. You can also try to ping or traceroute to the upstream from the Envoy instance to check basic network connectivity and latency. * Fix: Adjust the timeout settings on the intervening network devices to be more permissive, or ensure they are configured to properly handle the expected connection lifetimes. This is often a task for your network operations team.

4. Envoy’s Upstream Connection Timeout: While less common for this specific error (which implies before response started), it’s possible that Envoy’s internal timeout for establishing a connection to the upstream is being hit very quickly, and the upstream is then closing the connection it thought was established. This is more likely if the upstream is extremely slow to respond to the initial TCP handshake or SYN-ACK. * Diagnosis: Check Envoy’s cluster.connect_timeout. If it’s set very low (e.g., 1s), and your upstream is experiencing high load or network latency, this could be a factor. * Fix: Increase the connect_timeout in your Envoy cluster configuration. yaml static_resources: clusters: - name: some_upstream_cluster connect_timeout: 5s # Increased from default 1s or a lower custom value type: LOGICAL_DNS dns_lookup_family: V4_ONLY lb_policy: ROUND_ROBIN load_assignment: cluster_name: some_upstream_cluster endpoints: - lb_endpoints: - endpoint: address: socket_address: address: upstream-service port_value: 9000 * Why it works: This gives Envoy more time to successfully establish a TCP connection with the upstream before giving up and potentially causing the upstream to think the connection was invalid.

5. Keep-Alive Issues: If you’re using HTTP/1.1 keep-alive connections, and either Envoy or the upstream service is not managing the keep-alive timeout correctly, or if a network device is interfering, the connection might be closed prematurely by one end while the other still expects it to be open. * Diagnosis: This is harder to diagnose directly. Look for patterns in when the errors occur. If it’s happening after a period of inactivity but before a new request is fully sent, keep-alive could be involved. Check both Envoy’s downstream and upstream keep-alive settings, and any relevant network device timeouts. * Fix: Ensure connection_keep_alive timeouts are configured appropriately and consistently across Envoy (downstream and upstream) and your upstream service. The common_http_protocol_options.idle_timeout on the listener (for downstream) and cleanup_interval on the cluster (for upstream connection pooling) can influence this, though cleanup_interval is more about pooling than immediate connection closure.

6. Large Request/Response Headers or Body (Less Likely for this specific error): While usually manifesting as different errors, extremely large headers or initial response data that exceeds buffer limits could theoretically lead to connection resets if not handled gracefully by intermediate components or the client. This is less likely for upstream_reset_before_response_started because it implies no response data was sent. * Diagnosis: Monitor header sizes and initial response payloads. Check for buffer-related errors in Envoy or upstream logs. * Fix: Optimize header sizes, or ensure buffer sizes are adequate. This is usually a symptom of a deeper issue.

After fixing the underlying cause, the next error you’ll likely encounter if you haven’t addressed the client’s reason for closing the connection prematurely is a 503 with no_healthy_upstream.

Want structured learning?

Take the full Envoy course →