The Envoy proxy is terminating gRPC streams before the request headers are fully processed, leading to STREAMS_RESET_BEFORE_HEADERS errors. This typically means a downstream client (your application) or an upstream service is encountering an issue that forces Envoy to abandon the stream early.
Here are the most common reasons and how to fix them:
1. Client-Side HTTP/2 Frame Size Exceeded
Your client is sending an HTTP/2 frame that’s too large for Envoy’s default buffer. HTTP/2 has a default maximum frame size of 16MB, but Envoy’s internal buffers can be smaller.
Diagnosis:
Check your client’s gRPC logs for messages indicating frame size issues or timeouts during request initiation. On the Envoy side, enable debug logging for http2 and connection_pool and look for messages like HTTP/2 stream 123: received frame larger than max_frame_size (16384).
Fix:
Increase Envoy’s max_frame_size in the http_connection_manager configuration.
http_connection_manager:
stat_prefix: ingress_http
codec_type: auto
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
virtual_hosts:
- name: service_routes
routes:
- match:
prefix: "/"
route:
cluster: your_service_cluster
http2_protocol_options:
max_frame_size: 16777216 # Default is 16384, increase as needed (e.g., to 16MB)
This explicitly tells Envoy to accept larger HTTP/2 frames, allowing your client’s larger requests to pass through.
2. Upstream Connection Timeout on TLS Handshake
If your upstream service uses TLS, a slow or failing TLS handshake between Envoy and the upstream can cause Envoy to time out before the gRPC headers are sent.
Diagnosis:
In Envoy’s access logs (if configured for TLS handshake events), look for upstream connection errors or timeouts. You can also enable debug logging for tls and connection_pool in Envoy. On the upstream service side, check its logs for TLS handshake failures or long handshake times.
Fix:
Increase the connect_timeout for your upstream cluster.
clusters:
- name: your_service_cluster
connect_timeout: 5s # Default is often 1s or 5s, increase if TLS handshake is slow
type: LOGICAL_DNS
# ... other cluster config
transport_socket:
name: envoy.transport_sockets.tls
typed_config:
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
sni: your.upstream.service.com
A longer connect timeout gives Envoy more time to complete the TLS handshake with the upstream service, preventing premature stream resets.
3. Upstream Service Crashing or Restarting
The upstream gRPC service itself might be crashing or restarting during the initial phase of processing a new request, before it can even acknowledge the headers.
Diagnosis:
Check the logs and health status of your upstream gRPC service. Look for any application-level errors, panics, or restarts that coincide with the STREAMS_RESET_BEFORE_HEADERS errors in Envoy.
Fix: Address the root cause of the upstream service crashes. This could involve debugging application logic, fixing memory leaks, or increasing resource allocation (CPU, memory) for the service. The fix is specific to the service’s failure mode.
4. Envoy’s Idle Timeout on Upstream Connections
Envoy has idle timeouts for connections. If the upstream connection is established but no data is sent for a prolonged period by either side before the gRPC headers are sent, Envoy might close the connection.
Diagnosis:
Review Envoy’s cluster configuration for idle_timeout. If it’s set too low, it could be the culprit.
Fix:
Increase or disable the idle_timeout for the upstream cluster.
clusters:
- name: your_service_cluster
# ... other cluster config
common_http_protocol_options:
idle_timeout: 300s # Default is often 300s, increase if needed, or set to 0s to disable
By increasing the idle timeout, you give the client and server more time to complete their initial communication before Envoy considers the connection stale and closes it.
5. Malformed HTTP/2 Frames from Client
While less common, a client might be sending malformed HTTP/2 frames that Envoy cannot parse, leading to a stream reset.
Diagnosis: Enable detailed HTTP/2 debugging in Envoy. Look for error messages related to frame parsing or decoding. This might require deep inspection of the raw bytes if Envoy’s logs aren’t detailed enough.
Fix: Fix the client implementation to correctly generate HTTP/2 frames according to the RFC. This is a client-side development task.
6. Envoy Circuit Breaker Limits (Less Likely for Before Headers)
While circuit breakers usually manifest as 503 errors or different reset reasons, in extreme cases, a severely overloaded Envoy might reset streams very early. However, this is less likely to specifically hit STREAMS_RESET_BEFORE_HEADERS as it typically occurs after some processing.
Diagnosis:
Check Envoy’s statistics for circuit breaker trips, particularly for the upstream cluster. Look for cluster_name.circuit_breakers.cx_open, xx_open, etc.
Fix:
Adjust Envoy’s circuit breaker settings in the cluster.circuit_breakers configuration.
clusters:
- name: your_service_cluster
# ... other cluster config
circuit_breakers:
tripper_threshold:
max_connections: 10000 # Example: Increase limits
max_requests: 10000
max_retries: 1000
Increasing these limits allows Envoy to handle more concurrent connections, requests, or retries before tripping, potentially preventing early resets due to overload.
The next error you’ll likely encounter if all these are resolved is a more specific gRPC error from the upstream service, or a client-side network issue if connectivity is still problematic.