Datadog’s default metrics retention is designed for maximum visibility, but it can become a significant cost center if not managed.

Let’s see how Datadog handles metrics and how we can tune it.

Imagine you’re sending metrics from a fleet of 100 microservices, each emitting 50 custom metrics every minute. That’s 5,000 metrics per second, multiplied by 60 seconds, giving us 300,000 data points per minute. Over a day, that’s 432 million data points. Datadog stores these at different granularities, and the cost scales with volume and retention period.

Here’s a typical flow:

  1. Collection: Your applications or infrastructure agents (e.g., Datadog Agent) send metrics via the Datadog API.
  2. Ingestion: Datadog receives, processes, and indexes these metrics.
  3. Aggregation & Rollup: Metrics are aggregated into different time granularities (e.g., 1-minute, 5-minute, 1-hour averages). This is key to retention.
  4. Storage: Aggregated metrics are stored in time-series databases.
  5. Querying: When you view a graph or run a query, Datadog retrieves the relevant data from these databases.

The core concept is roll-up. Datadog doesn’t store every single raw data point indefinitely. It aggregates data into coarser granularities over time. This means that for older data, you’re looking at averages over longer periods, not the original second-by-second (or millisecond-by-millisecond) values.

What problem does this solve? It allows you to retain a long history of your system’s performance without incurring astronomical storage costs. You can see trends over months or years, but the fidelity of that historical data is lower. For short-term debugging, you need high fidelity (e.g., 1-minute or 15-second data). For long-term trend analysis, coarser data (e.g., hourly averages) is sufficient.

How does it work internally? Datadog uses a tiered storage approach.

  • Hot Storage: For recent data (typically 15 months for standard metrics), data is stored at its highest resolution. This allows for fast querying and detailed analysis.
  • Cold Storage: For older data, metrics are aggregated into coarser granularities (e.g., hourly averages) and stored more cost-effectively. This tier is for historical trend analysis, not for detailed debugging.

What are the exact levers you control? While you don’t directly control the specific roll-up intervals (Datadog manages that internally), you control the overall retention period for different metric types. This is done through the Datadog UI or API by setting metric_volume_per_day and retention_days for custom metrics. The key is understanding that the cost isn’t just about how many days you keep data, but at what granularity you keep it.

The most surprising thing is that Datadog’s default retention policy isn’t a single number but a tiered system. For standard metrics, you get 15 months of "hot" (high-resolution) data and then an additional 13 months of "cold" (aggregated) data, totaling 28 months. However, this applies to standard metrics. Custom metrics, which are often the most expensive due to their volume, have configurable retention.

Let’s say you have a custom metric like myapp.request.latency that you want to keep detailed for 3 months but have a general overview for 2 years.

  1. Identify Custom Metrics: Go to Metrics -> Metric Explorer. Filter by metric_type:custom or look for metrics you know are custom.
  2. Check Current Retention: For a specific custom metric, say myapp.custom.events, go to its detail page. You’ll see its current retention settings.
  3. Configure Retention (UI Example):
    • Navigate to Metrics -> Metric Management.
    • Find your custom metric (e.g., myapp.custom.events).
    • Click on the metric name to open its configuration.
    • Under "Retention," you’ll see options. You can set the "Retention for custom metrics" to 3 months (for high resolution) and an "Extended retention" for 2 years (for aggregated data).
    • The system will ask you to confirm the estimated cost change.
  4. Configure Retention (API Example - using metrics endpoint): To update a metric named myapp.custom.events to have 3 months of high-resolution data and 2 years of extended retention:
    curl -X PUT \
      -H "DD-API-KEY: YOUR_API_KEY" \
      -H "DD-SITE: datadoghq.com" \
      -H "Content-Type: application/json" \
      https://api.datadoghq.com/api/v1/metrics/custom_metrics/myapp.custom.events \
      -d '{
            "metric_name": "myapp.custom.events",
            "description": "Custom event count",
            "type": "count",
            "unit": "1",
            "short_name": "custom_ev",
            "retention_days": 90,
            "extended_retention_days": 730
          }'
    
    • retention_days: 90 sets the high-resolution data to 90 days (3 months).
    • extended_retention_days: 730 sets the aggregated data to 730 days (2 years).

This configuration balances the need for detailed debugging data for recent events with the long-term historical trends required for capacity planning and business intelligence, all while managing your Datadog bill.

The next step after optimizing retention is often exploring how to reduce the volume of metrics being sent in the first place.

Want structured learning?

Take the full Datadog course →