DDSketch isn’t just another way to store percentiles; it’s a probabilistic data structure that makes P99 metrics more accurate at scale, not less.
Let’s see it in action. Imagine we’re tracking the latency of a critical API endpoint. Without DDSketch, we’d typically collect all latencies and then calculate the P99. But if we have millions of requests, storing and processing all those individual latencies becomes a massive undertaking, both in terms of memory and CPU.
Here’s a simplified view of how DDSketch works without showing you the complex math:
- Bucketing: DDSketch divides the range of possible values (latencies in our case) into logarithmically spaced buckets. Think of it like a histogram, but the bins get wider as the values get larger.
- Compression: As latencies arrive, they are assigned to the appropriate bucket. Instead of storing each individual latency, DDSketch stores a count of how many values fell into each bucket.
- Estimation: When you ask for a P99, DDSketch uses these bucket counts to estimate the 99th percentile. Because the buckets are logarithmically spaced, it can maintain good precision in the high-value buckets where the P99 is likely to fall, while still being memory-efficient for the vast number of lower-value requests.
This is what a DDSketch might look like internally (conceptually, not actual output):
Bucket 0-1ms: 1,500,000 requests
Bucket 1-5ms: 300,000 requests
Bucket 5-10ms: 50,000 requests
Bucket 10-20ms: 10,000 requests
Bucket 20-50ms: 2,000 requests
Bucket 50-100ms: 500 requests
Bucket 100-250ms: 100 requests
Bucket 250-500ms: 20 requests
Bucket 500-1000ms: 5 requests
Bucket 1000-2500ms: 1 request
... and so on
When you query for P99, DDSketch looks at the total number of requests and determines which bucket contains the 99th percentile. It then provides an estimated value within that bucket. The key is that the relative error of this estimate is bounded, meaning the accuracy improves as the number of data points increases, precisely when traditional methods start to choke.
The problem DDSketch solves is the "curse of dimensionality" for metrics. As your system scales, the sheer volume of data makes exact calculations of high percentiles (like P99 or P99.9) computationally infeasible. Storing every single latency value requires enormous memory, and processing them to find the exact P99 becomes a bottleneck. DDSketch offers a trade-off: a tiny, controllable margin of error in exchange for dramatically reduced resource consumption and the ability to calculate these critical metrics even under extreme load.
The core idea is that to know the P99, you don’t need to know the exact value of every single request. You only need to know, roughly, how many requests fell into certain ranges, especially the high-latency ranges. DDSketch’s logarithmic bucketing is designed to provide just enough information in those critical high-value buckets to give you a very good estimate of the P99 without needing to store every single data point.
In Datadog, this means your P99 graphs remain responsive and accurate even as your application handles millions of requests per minute. Instead of seeing "data unavailable" or wildly fluctuating P99s due to sampling or aggregation limitations, you get a consistent, reliable view of your system’s tail latency.
The most subtle aspect of DDSketch’s accuracy at scale comes from its guaranteed relative error bound. This isn’t just a heuristic; the algorithm is mathematically designed such that the estimated percentile (e.g., P99) will be within a certain percentage of the true percentile, and this bound tightens as more data is added. This means that at high volume, DDSketch often provides a more reliable P99 than trying to compute it exactly from a massive, unmanageable dataset.
The next step is understanding how to tune the accuracy of DDSketch itself, which involves adjusting its compression parameters.