The most surprising thing about distributed tracing is that it doesn’t actually trace anything in the way you might imagine; it’s a reconstruction of events based on shared identifiers.

Let’s watch this happen with a simple example. Imagine two services: frontend and backend. When a request comes into frontend, it generates a unique trace ID. This ID is then propagated to backend along with the request. If backend then calls another service, database, it carries that same trace ID. Elastic APM, when configured correctly, captures these trace IDs and associated span IDs (which represent individual operations within a service) and allows you to stitch them together.

Here’s a snippet of what the data might look like in Elastic APM’s APM UI. You’d see a waterfall diagram showing the frontend request, then the call to backend, and finally any calls backend makes. Each bar represents a span, and they’re visually linked by that common trace ID.

// Example of what APM agents might send (simplified)
{
  "trace_id": "a1b2c3d4e5f67890",
  "transaction_id": "tx-frontend-123",
  "id": "span-frontend-abc",
  "name": "GET /api/items",
  "service": {"name": "frontend"},
  "timestamp": "2023-10-27T10:00:00.123Z",
  "duration": 50, // ms
  "parent_id": null // Root span
}
{
  "trace_id": "a1b2c3d4e5f67890",
  "transaction_id": "tx-backend-456", // Different transaction ID for backend, but same trace ID
  "id": "span-backend-def",
  "name": "db.query",
  "service": {"name": "backend"},
  "timestamp": "2023-10-27T10:00:00.150Z",
  "duration": 30, // ms
  "parent_id": "span-frontend-abc" // Links back to frontend's span
}

The problem distributed tracing solves is the "black box" effect of microservices. In a monolithic application, you could step through code line by line to understand a request’s flow. With microservices, a single user-facing request might involve dozens of independent services. Without tracing, pinpointing where latency occurs or where an error originates becomes incredibly difficult. You’d see an error reported by frontend, but was it frontend’s fault, or did backend fail to respond? Tracing answers this by showing the complete journey of that request across all services.

Internally, Elastic APM agents (for languages like Java, Python, Node.js, Go, etc.) are responsible for two key things:

  1. Instrumentation: They automatically (or with minimal code changes) hook into common libraries and frameworks used by your services. For example, they’ll hook into HTTP client libraries to capture outgoing requests and into web frameworks to capture incoming requests.
  2. Context Propagation: This is the critical part for correlation. When a service makes a call to another, the agent injects the current trace_id and the id of the current span (which becomes the parent_id for the next span) into the outgoing request’s headers. Common headers used are traceparent and tracestate (following the W3C Trace Context standard) or proprietary headers like X-Elastic-Trace-Id. The receiving service’s agent then reads these headers to establish the parent-child relationship.

The exact levers you control are primarily around how the agents are configured and deployed. This includes:

  • Service Name: Crucial for identifying individual services in the APM UI. Set via ELASTIC_APM_SERVICE_NAME environment variable or service_name in the agent’s config.
  • Server URL: Where the APM agent sends its data. Set via ELASTIC_APM_SERVER_URL.
  • Environment: To differentiate between development, staging, and production data. Set via ELASTIC_APM_ENVIRONMENT.
  • Sample Rate: To control how many requests are traced. If you have millions of requests, you might only sample 1% to reduce overhead. Set via ELASTIC_APM_TRANSACTION_SAMPLE_RATE.
  • Custom Spans: For operations not automatically instrumented (e.g., specific business logic blocks), you can manually create spans within your code.

A subtle but powerful aspect of context propagation is its reliance on the underlying transport layer. If you’re using a custom RPC framework or a messaging queue that doesn’t automatically forward specific HTTP headers (or if you’re not explicitly configuring it to do so), trace context can be lost. This means a trace might appear to start in one service but then abruptly end, with no downstream spans linked to it, even if the downstream service is sending APM data. The agent can’t magically re-establish the link; it relies on that propagated ID.

The next concept you’ll likely grapple with is understanding how to effectively use this correlated data to diagnose performance bottlenecks, not just errors.

Want structured learning?

Take the full Elastic-apm course →