Elastic APM’s sampling is designed to reduce the volume of trace data without losing crucial insights into application performance.
Here’s a trace being generated and sampled:
{
"trace_id": "a1b2c3d4e5f67890",
"transaction_id": "1a2b3c4d5e6f7890",
"name": "GET /api/users/:id",
"type": "request",
"timestamp": "2023-10-27T10:00:00.123Z",
"duration": 150.5,
"result": "HTTP 200",
"span_count": {
"total": 15,
"started": 15
},
"sampled": true,
"system": {
"hostname": "my-app-server-01"
},
"service": {
"name": "user-service",
"version": "1.2.0"
},
"user": {
"id": "user123"
},
"tags": {
"http.status_code": 200,
"url.path": "/api/users/:id"
}
}
Head-based sampling happens before the trace is even sent to the APM Server. The Elastic APM Agent makes a decision based on a configured probability. If the agent decides to sample a trace, it will record and send all spans associated with that trace. If it decides not to sample, it discards the entire trace, including all its spans. This is the most efficient method for reducing data volume because it prevents data from ever leaving the application.
Tail-based sampling, on the other hand, is configured on the APM Server. Here, the APM Server receives all trace data from the agents. It then applies sampling rules to decide which traces to keep and which to discard. This allows for more sophisticated sampling strategies because the decision is made with complete information about the entire trace, including all its spans. For example, you can choose to always keep traces that resulted in an error or took longer than a certain duration, regardless of the initial head-based sampling decision.
To configure head-based sampling, you modify the sampling_rate setting in your APM Agent’s configuration. For example, to sample 10% of all traces in a Java agent, you’d set sampling_rate: 0.1 in elastic-apm.properties:
service_name: my-java-app
server_url: http://localhost:8200
sampling_rate: 0.1
This means that for every 100 traces generated, only 10 will be sent to the APM Server. The sampled field in the trace document will be true for those 10 traces and false for the other 90.
Tail-based sampling is configured in the APM Server’s configuration file, typically apm-server.yml. You define policies that evaluate incoming traces. A common scenario is to ensure that all error traces are kept. Here’s an example configuration snippet:
apm-server:
sampling:
tail:
policies:
- name: keep-errors
sampling_rate: 1.0 # Always sample if error condition is met
conditions:
- 'transaction.result'
- '!=2xx' # Keep if result is not a 2xx status code
- '!=3xx'
# Default sampling rate for traces not matching any policy (e.g., 0.5 for 50%)
default_sampling_rate: 0.5
In this tail-based policy, any trace where the transaction.result is not a 2xx or 3xx status code (i.e., it’s an error or a redirect that wasn’t handled as expected) will have a sampling_rate of 1.0, meaning it’s guaranteed to be kept. Traces that don’t meet this condition will then be subject to the default_sampling_rate of 0.5, meaning 50% of them will be sampled.
The sampled field in the trace document is ultimately set by the APM Server’s tail-based sampling decision if tail-based sampling is enabled. If tail-based sampling is not enabled, then the sampled field reflects the head-based decision made by the agent.
A powerful, yet often overlooked, aspect of tail-based sampling is its ability to use contextual information from all spans within a trace to make its decision, not just the transaction itself. For instance, you could create a policy that samples a trace if any of its constituent spans indicate a specific type of slow database query, even if the overall transaction duration was acceptable.
The interaction between head-based and tail-based sampling is crucial: head-based sampling acts as a first line of defense, reducing the load on agents and networks. Tail-based sampling then refines this, ensuring critical data isn’t lost and allowing for more intelligent data retention based on the complete trace context.
If you configure tail-based sampling on your APM Server and your agents are also configured with a sampling_rate less than 1.0, the APM Server will receive traces that the agent has already decided not to sample (where sampled: false). The tail-based sampling policies on the APM Server will then re-evaluate these traces. If a discarded trace meets a tail-based policy’s criteria, it can still be kept. This means tail-based sampling can effectively "unsample" traces that head-based sampling initially dropped, ensuring that important error or slow traces are always captured.
The next step after configuring sampling is to understand how to correlate sampled traces with specific events or performance characteristics using trace analytics and Kibana dashboards.