Envoy’s tracing capabilities, when combined with Zipkin and Jaeger, don’t just passively record requests; they actively participate in the request lifecycle, influencing how distributed systems perceive their own latency.

Let’s see this in action. Imagine a simple service mesh with two services, frontend and backend, and Envoy as their sidecar proxy.

# Envoy config for frontend service
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 10000 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: frontend
          route_config:
            name: frontend_routes
            virtual_hosts:
            - name: backend_service
              domains: ["backend"]
              routes:
              - match: { prefix: "/" }
                route:
                  cluster: backend_cluster
                  # Crucial for propagating trace headers
                  request_headers_to_add:
                  - header:
                      key: "x-request-id"
                      value: "%7B%22trace_id%22:%22%7B%7Brandom%7D%7D%22%7D" # Example dynamic trace ID
          http_filters:
          - name: envoy.filters.http.router
            typed_config: {}
          # Tracing filter configuration
          tracing:
            provider:
              name: zipkin
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.trace.v3.ZipkinConfig
                collector_cluster: zipkin
                collector_endpoint: "http://zipkin:9411/api/v2/spans"
                shared_span_id: true
    # Other listener configs...

  clusters:
  - name: backend_cluster
    connect_timeout: 0.25s
    type: LOGICAL_DNS
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: backend_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: 10.0.0.2, port_value: 8080 } # Address of backend service
  - name: zipkin
    connect_timeout: 1s
    type: STATIC
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: zipkin
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address: { address: 172.19.0.2, port_value: 9411 } # Address of Zipkin collector

When a request hits frontend, Envoy intercepts it. If tracing is enabled, it checks for incoming trace headers (like x-b3-traceid, x-b3-spanid, x-b3-sampled). If they exist, it continues the trace. If not, it generates a new trace ID. Then, it sends the request to the backend_cluster. Crucially, before forwarding, it injects its own span information (as x-b3-parentspanid and x-b3-spanid) and the trace ID into the outgoing request headers. The backend service, if also configured with an Envoy sidecar, repeats this process. Finally, both Envoys independently send their span data to the configured Zipkin or Jaeger collector.

The core problem this solves is visibility in distributed systems. Without tracing, debugging a slow request across multiple microservices is like trying to find a needle in a haystack blindfolded. You know something went wrong, but pinpointing where and why is nearly impossible. Envoy’s tracing acts as a breadcrumb trail, allowing you to reconstruct the entire request path, identify bottlenecks, and understand latency contributions from each hop.

Internally, Envoy uses the OpenTracing or OpenCensus standards (or their native Zipkin/Jaeger formats) to represent spans. A span is a fundamental unit of work, representing a single operation (like an HTTP request). Each span has a unique ID, a trace ID (which groups all spans for a single distributed transaction), a name, start and end timestamps, and tags (key-value pairs for metadata). Envoy’s tracing filter intercepts requests and responses, creating spans for the time spent within the proxy itself (e.g., routing, TLS termination, filter processing) and for the time spent waiting for upstream services. When it forwards a request, it propagates the trace context (trace ID, span ID, sampling decision) via standardized HTTP headers. This allows downstream services to continue the same trace.

The most surprising aspect for many is how Envoy’s tracing configuration directly influences which requests get traced. The sampling configuration (often a percentage or a rate) within the tracing provider dictates this. A common misconception is that tracing always captures 100% of requests. However, for performance reasons and to avoid overwhelming tracing backends, you typically sample. This means that for every 100 requests, only, say, 5 might be fully traced. The ZipkinConfig or JaegerConfig will have a sampling field, often set to 0.05 for 5% sampling. This isn’t just about deciding if a span is created; it’s about propagating that decision downstream. If Envoy decides not to sample a request, it often injects headers that tell downstream services to also skip tracing that specific request, thus saving resources throughout the entire trace.

The next hurdle you’ll encounter is configuring the tracing provider itself, particularly when dealing with complex sampling strategies or custom metadata propagation.

Want structured learning?

Take the full Envoy course →