Envoy’s tracing capabilities, when combined with Zipkin and Jaeger, don’t just passively record requests; they actively participate in the request lifecycle, influencing how distributed systems perceive their own latency.
Let’s see this in action. Imagine a simple service mesh with two services, frontend and backend, and Envoy as their sidecar proxy.
# Envoy config for frontend service
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: frontend
route_config:
name: frontend_routes
virtual_hosts:
- name: backend_service
domains: ["backend"]
routes:
- match: { prefix: "/" }
route:
cluster: backend_cluster
# Crucial for propagating trace headers
request_headers_to_add:
- header:
key: "x-request-id"
value: "%7B%22trace_id%22:%22%7B%7Brandom%7D%7D%22%7D" # Example dynamic trace ID
http_filters:
- name: envoy.filters.http.router
typed_config: {}
# Tracing filter configuration
tracing:
provider:
name: zipkin
typed_config:
"@type": type.googleapis.com/envoy.extensions.trace.v3.ZipkinConfig
collector_cluster: zipkin
collector_endpoint: "http://zipkin:9411/api/v2/spans"
shared_span_id: true
# Other listener configs...
clusters:
- name: backend_cluster
connect_timeout: 0.25s
type: LOGICAL_DNS
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: backend_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: 10.0.0.2, port_value: 8080 } # Address of backend service
- name: zipkin
connect_timeout: 1s
type: STATIC
dns_lookup_family: V4_ONLY
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: zipkin
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: 172.19.0.2, port_value: 9411 } # Address of Zipkin collector
When a request hits frontend, Envoy intercepts it. If tracing is enabled, it checks for incoming trace headers (like x-b3-traceid, x-b3-spanid, x-b3-sampled). If they exist, it continues the trace. If not, it generates a new trace ID. Then, it sends the request to the backend_cluster. Crucially, before forwarding, it injects its own span information (as x-b3-parentspanid and x-b3-spanid) and the trace ID into the outgoing request headers. The backend service, if also configured with an Envoy sidecar, repeats this process. Finally, both Envoys independently send their span data to the configured Zipkin or Jaeger collector.
The core problem this solves is visibility in distributed systems. Without tracing, debugging a slow request across multiple microservices is like trying to find a needle in a haystack blindfolded. You know something went wrong, but pinpointing where and why is nearly impossible. Envoy’s tracing acts as a breadcrumb trail, allowing you to reconstruct the entire request path, identify bottlenecks, and understand latency contributions from each hop.
Internally, Envoy uses the OpenTracing or OpenCensus standards (or their native Zipkin/Jaeger formats) to represent spans. A span is a fundamental unit of work, representing a single operation (like an HTTP request). Each span has a unique ID, a trace ID (which groups all spans for a single distributed transaction), a name, start and end timestamps, and tags (key-value pairs for metadata). Envoy’s tracing filter intercepts requests and responses, creating spans for the time spent within the proxy itself (e.g., routing, TLS termination, filter processing) and for the time spent waiting for upstream services. When it forwards a request, it propagates the trace context (trace ID, span ID, sampling decision) via standardized HTTP headers. This allows downstream services to continue the same trace.
The most surprising aspect for many is how Envoy’s tracing configuration directly influences which requests get traced. The sampling configuration (often a percentage or a rate) within the tracing provider dictates this. A common misconception is that tracing always captures 100% of requests. However, for performance reasons and to avoid overwhelming tracing backends, you typically sample. This means that for every 100 requests, only, say, 5 might be fully traced. The ZipkinConfig or JaegerConfig will have a sampling field, often set to 0.05 for 5% sampling. This isn’t just about deciding if a span is created; it’s about propagating that decision downstream. If Envoy decides not to sample a request, it often injects headers that tell downstream services to also skip tracing that specific request, thus saving resources throughout the entire trace.
The next hurdle you’ll encounter is configuring the tracing provider itself, particularly when dealing with complex sampling strategies or custom metadata propagation.