Datadog APM can trace requests as they hop between services, revealing bottlenecks that would otherwise be invisible.

Let’s see it in action. Imagine a user request hitting a web frontend, which then calls a user service, which in turn queries a database.

Here’s a simplified representation of what that might look like in Datadog:

[Web Frontend] --> [User Service] --> [Database]

In Datadog, this would appear as a "trace," a visual representation of that single request’s journey. You’d see the total duration of the request and the time spent in each service.

Trace: Total Duration 500ms
  - Web Frontend: 50ms
  - User Service: 300ms
    - Database Call: 250ms
  - Web Frontend (response): 50ms

This allows you to pinpoint where the slowdown is. In this example, the User Service is taking 300ms, with a significant chunk attributed to the database call.

The Problem Datadog APM Solves

Before distributed tracing, if a user reported slow performance, you’d be staring at logs across multiple services, trying to correlate them by timestamp. It’s a needle-in-a-haystack operation. Datadog APM links these requests together automatically using trace IDs and span IDs.

  • Trace ID: A unique identifier for an entire end-to-end request. All spans (individual operations within a request) belonging to the same request share the same trace ID.
  • Span ID: A unique identifier for a specific operation within a trace, like a single HTTP call to a service or a database query. Spans can have parent-child relationships, forming the tree structure of a trace.

How it Works Internally

Datadog’s APM relies on instrumenting your application code. This means adding small pieces of code (often via libraries or agents) that:

  1. Start a trace: When a request enters your system (e.g., an incoming HTTP request to your web frontend), the tracing library starts a new trace and assigns it a trace ID.
  2. Create spans: As the request moves between services or performs operations (like a database query), new spans are created. Each span records its start time, end time, operation name, and relevant tags (e.g., service name, endpoint, HTTP method, status code).
  3. Propagate context: Crucially, when one service calls another, the trace ID and the parent span ID are "injected" into the outgoing request headers. The receiving service’s tracing library reads these headers, uses the existing trace ID, and creates a new child span linked to the parent.
  4. Send spans to Datadog: Periodically, the tracing library or agent sends the collected span data to the Datadog backend for aggregation and analysis.

Configuration Levers

The primary way you control Datadog APM is through its configuration. This usually involves:

  • Agent Configuration: The Datadog Agent needs to be configured to listen for and collect traces. This often involves setting DD_APM_ENABLED=true and specifying the DD_SITE.
  • Application Instrumentation:
    • Environment Variables: Many integrations use environment variables for configuration, like DD_SERVICE (to name your service) and DD_TRACE_ENABLED=true.
    • Code-level Configuration: For more advanced use cases, you can programmatically configure tracing within your application code, setting custom tags, sampling rates, or defining specific spans.
  • Service Mapping: You explicitly tell Datadog which traces belong to which service, typically via the DD_SERVICE environment variable. This is how Datadog groups your traces and builds the service map.
  • Sampling: To manage the volume of trace data, you can configure sampling. This means Datadog only collects a percentage of traces. Common settings include DD_SAMPLING_PRIORITY or configuring a global sampling rate in the agent.

Consider the DD_TRACE_SAMPLE_RATE environment variable. Setting DD_TRACE_SAMPLE_RATE=0.1 means that only 10% of traces will be sent to Datadog. This is vital for high-traffic services to avoid overwhelming your Datadog account with data and incurring high costs, while still providing enough visibility to catch intermittent issues.

The system automatically correlates requests by injecting trace context into HTTP headers using a standardized format, often X-Datadog-Trace-Id and X-Datadog-Parent-Id. When a downstream service receives a request with these headers, its tracer library automatically picks them up, creating a new span that is a child of the incoming trace context. This propagation is the magic behind linking disparate service calls into a single, coherent trace.

The next step is to explore how to use these traces to automatically detect and alert on anomalies in your service performance.

Want structured learning?

Take the full Datadog course →