Datadog APM trace sampling is not just a knob to turn to save money; it’s a fundamental mechanism for shaping your observability data and, by extension, your understanding of your application’s performance.
Let’s see it in action. Imagine you have a microservice, user-service, that handles millions of requests a day. Without sampling, every single trace generated by this service would be sent to Datadog. If each trace is, say, 100KB, that’s potentially terabytes of data daily, quickly becoming unmanageable and expensive.
Here’s what a typical trace might look like in Datadog:
{
"trace_id": "a1b2c3d4e5f67890",
"span_id": "0987654321fedcba",
"name": "POST /users",
"service": "user-service",
"resource": "POST /users",
"start": 1678886400123456789,
"duration": 50000000,
"http": {
"method": "POST",
"url": "/users",
"status_code": 201
},
"error": 0,
"meta": {
"usr.id": "12345",
"http.useragent": "curl/7.64.1",
"dd.trace_sampling_decision": "true" // This is key!
},
"metrics": {
"system.cpu.user.percent": 15.2
}
}
The dd.trace_sampling_decision tag is where the magic happens. If this is true, the trace is sent. If false, it’s dropped before it leaves the agent.
The Problem: Observability vs. Cost
The core problem Datadog trace sampling solves is the explosion of data generated by comprehensive APM. You want to see everything to catch those rare, critical errors or performance bottlenecks. But sending everything is prohibitively expensive and can overwhelm your observability platform, making it harder to find signal in the noise. Sampling allows you to get a representative view of your application’s behavior without ingesting every single trace.
How it Works Internally: The Agent’s Role
Datadog’s APM tracing typically involves an agent (or sidecar) running alongside your application. This agent intercepts outgoing spans, assembles them into traces, and then decides whether to send them to the Datadog backend. Sampling rules are configured on this agent.
There are two primary sampling strategies:
-
Head-based sampling: This is the default and most common. The decision to sample or drop a trace is made at the beginning of the trace (the root span). If the root span is sampled, all subsequent spans within that trace are also sent. This ensures that you get the full context for any trace that is captured.
-
Tail-based sampling: This is more advanced and usually configured in the Datadog backend. Here, traces are sent to the backend first, and then a sampling decision is made based on the entire trace’s characteristics (e.g., if it’s an error, if it has high latency, or based on specific tags). This is more powerful for capturing rare events but incurs higher ingestion costs because all traces are initially sent. For cost control, we’re primarily concerned with head-based sampling.
Controlling the Levers: Configuration
You configure sampling primarily through the Datadog Agent’s configuration or via environment variables for language agents.
The most fundamental setting is the trace_sampling.rate. This is a floating-point number between 0.0 and 1.0.
trace_sampling.rate: 1.0: Sample 100% of traces (no sampling). This is what you want for debugging a specific issue, but not for general production monitoring at scale.trace_sampling.rate: 0.1: Sample 10% of traces. The agent will randomly select roughly 10% of the root spans it encounters and send their complete traces.trace_sampling.rate: 0.0: Sample 0% of traces (effectively disabling tracing, though the agent still collects spans locally).
This rate is applied globally by default. However, you can get much more granular using sampling rules.
Example Configuration (datadog.yaml on the agent):
apm_config:
enabled: true
trace_sampling:
rate: 0.5 # Default rate for everything not matched by a specific rule
rules:
# Rule 1: Sample 100% of traces for the 'payment-service'
- service: "payment-service"
rate: 1.0
# Rule 2: Sample only 10% of traces for 'user-service' endpoints
- service: "user-service"
name: "POST /users" # Matches specific operation
rate: 0.1
# Rule 3: Sample 50% of traces that have an error tag set
- tag: "error"
value: "true"
rate: 0.5
In this example:
- Traces from
payment-serviceare always kept (rate 1.0). This is useful for critical services where you need full visibility. - Traces for
POST /usersinuser-serviceare sampled at 10% (rate 0.1). This drastically reduces data volume for a high-traffic, potentially less critical endpoint. - Any trace (regardless of service or name) that has an
errortag set totruewill be sampled at 50%. This is a hybrid approach: you don’t sample all errors (to save cost), but you increase the probability of capturing errors compared to the default rate. - Any trace not matching these rules will fall back to the global
rate: 0.5.
The order of rules matters. The first matching rule is applied.
The most surprising true thing about Datadog trace sampling is that it’s a probabilistic, not deterministic, mechanism by default. When you set a rate like 0.1, it doesn’t mean "exactly 1 out of every 10 traces." It means each trace has a 10% chance of being sampled. Over a large volume, this averages out, but for short periods, you might see more or fewer than your target percentage.
This probabilistic nature is key to its efficiency. If you were to deterministically sample, say, the 5th trace that comes in, you might miss an early critical error or capture a perfectly normal trace. The random chance ensures a more representative snapshot of overall application behavior.
The Next Step: Custom Tags for Granular Sampling
Once you’ve got basic sampling by service or endpoint, the next frontier is using custom tags. You can tag traces with business-relevant information (e.g., customer_tier: premium, order_id: 12345) and then create sampling rules based on those tags. This allows you to, for instance, sample 100% of traces for premium customers but only 5% for standard customers, or ensure that traces associated with specific order_ids are always captured. This requires instrumenting your application to add these tags, but the payoff is incredibly fine-grained control over your APM data and costs.