Datadog’s billing is primarily driven by metrics, logs, and traces ingested and retained. To cut costs, you need to understand what you’re sending and how long you’re keeping it.

Here’s how to optimize:

1. Metrics

The Surprising Truth: Datadog charges for every unique metric name and tag combination. This means service:api,env:prod,host:web-01,metric:cpu.usage is a distinct metric from service:api,env:prod,host:web-02,metric:cpu.usage.

System in Action:

Imagine you have a cluster of 100 web servers, each sending CPU metrics.

# Example of a metric being sent (simplified)
# This would be picked up by the Datadog agent
{
  "metric": "system.cpu.user",
  "points": [[1678886400, 12.34]],
  "host": "web-01.example.com",
  "tags": ["env:prod", "service:web"]
}

If your agent configuration is too broad, you might be sending host-specific metrics for every single instance.

Mental Model:

  • Ingestion: You pay per metric point ingested. A point is a single data value at a specific timestamp.
  • Cardinality: This is the key cost driver for metrics. High cardinality means a large number of unique metric-tag combinations. The more unique combinations, the more Datadog has to index and store.
  • Retention: You pay for storing metrics over time. Longer retention means higher storage costs.

Levers:

  • Metric Filtering: Exclude metrics you don’t need at the agent level.
  • Tagging Strategy: Be mindful of tags that increase cardinality. Avoid dynamic tags that change frequently or are unique per instance if not strictly necessary.
  • Metric Aggregation: If you only need aggregate CPU usage for your web service in production, don’t send it from every single host. Aggregate it at a higher level before sending.
  • Retention Policies: Configure Datadog to keep high-resolution data for a shorter period and downsampled data for longer.

Diagnosis:

Go to Metrics -> Summary in Datadog. Look at the "Top Metrics by Volume" and "Top Tags by Cardinality." Identify metrics with an explosion of unique tag values.

Fix:

On your Datadog Agent configuration (e.g., datadog.yaml or a specific check config), add exclude_metric_by_tags or metric_groups.

  • Example: To exclude all system.cpu metrics that have a specific instance_id tag (which might be unique per server):

    # datadog.yaml
    exclude_metric_by_tags:
      - 'system.cpu.*:instance_id:*'
    

    This tells the agent not to send any metric starting with system.cpu. if it has a tag named instance_id with any value.

  • Example: To only collect system.cpu metrics for specific hosts or environments:

    # conf.d/system.d/conf.yaml
    init_config:
    instances:
      -
        # Collect system metrics
        collect_cpu_percent: true
        collect_load_average: true
        # ... other system metrics ...
        # Only collect for specific tags
        tags:
          - 'env:production'
          - 'service:api'
    

    This configuration snippet restricts the system check to only send metrics that have both env:production and service:api tags.

Why it works: By filtering metrics at the agent level, you reduce the number of unique metric-tag combinations sent to Datadog, directly lowering your ingestion volume and cardinality count.

2. Logs

The Surprising Truth: Datadog charges for every byte of logs ingested and retained. This means verbose application logs, debug messages, and unstructured data can quickly inflate your bill.

System in Action:

Consider an application that logs every single request it receives, including full request/response bodies, to a file.

2023-10-27 10:00:00 INFO Request received: GET /users/123
{
  "headers": { "Authorization": "Bearer abcdef12345" },
  "body": { "user_id": 123, "details": "..." }
}
2023-10-27 10:00:01 DEBUG Processing user request for ID 123
2023-10-27 10:00:02 INFO Response sent: 200 OK for GET /users/123

If this log file is several megabytes per second, and you have dozens of such services, your log volume will skyrocket.

Mental Model:

  • Ingestion: You pay for the raw size of log data sent to Datadog.
  • Parsing: Datadog parses logs to extract attributes for searching and visualization. Complex or poorly formatted logs can increase processing overhead.
  • Indexing: Logs are indexed to make them searchable. High volume means more storage and search costs.
  • Retention: Similar to metrics, you pay for storing logs.

Levers:

  • Log Level Control: Reduce the verbosity of your application logs in production environments.
  • Log Processing Rules: Use Datadog’s Log Processing or Pipelines to drop unnecessary fields, re-map attributes, or drop entire logs.
  • Agent Configuration: Configure the Datadog agent to only tail specific log files or to exclude certain log patterns.
  • Retention Policies: Define how long you need to keep logs at different levels of detail.

Diagnosis:

Navigate to Logs -> Search in Datadog. Look at the "Log Volume" graph over time. You can also inspect individual log lines to see their size and the attributes that are being parsed. Check the Logs -> Processing section to understand your current parsing rules and their impact.

Fix:

  1. Application Log Level: In your application’s configuration (e.g., log4j.properties, appsettings.json, logging.yaml), set the production log level to INFO or WARN instead of DEBUG or TRACE.

    • Example (Java Logback):
      <configuration>
        <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
          <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
          </encoder>
        </appender>
      
        <root level="INFO">
          <appender-ref ref="STDOUT" />
        </root>
      </configuration>
      
      This sets the root logger to INFO, meaning DEBUG and TRACE messages won’t be outputted.
  2. Datadog Agent Log Collection: Configure the Datadog agent’s logs.yaml to exclude certain files or patterns.

    • Example: To stop collecting logs from a specific temporary directory:
      logs:
        - type: file
          path: /var/log/myapp/*.log
          service: myapp
          source: myapp
          log_processing_rules:
            - type: drop
              name: drop_debug_logs
              pattern: "DEBUG"
      
      This configuration tells the agent to collect logs from /var/log/myapp/*.log for the myapp service, but then apply a processing rule to drop any log line that contains the word "DEBUG".
  3. Datadog Log Pipelines: In Datadog, go to Logs -> Configuration -> Pipelines. Create or edit a pipeline to drop specific attributes or entire logs.

    • Example: To drop the request.body attribute from all logs:
      • Create a new pipeline.
      • Add a drop_attribute processor.
      • Set the attribute to request.body.
      • Ensure this pipeline is ordered before any other pipeline that might index request.body.

Why it works: By reducing the amount of data generated by your applications and filtering it before or at ingestion, you directly decrease the volume of logs sent to Datadog, saving on ingestion and storage costs.

3. Retention

The Surprising Truth: Datadog’s default retention for metrics and logs can be much longer than you actually need for daily operations or compliance. You pay for every day that data sits in their hot storage.

Mental Model:

  • Hot Storage: Data readily available for querying and visualization. This is the most expensive tier.
  • Cold Storage: Archived data, less accessible, and cheaper. Datadog offers different retention tiers for metrics and logs, with increasing costs for longer hot storage.
  • Compliance Requirements: Understand your regulatory needs for data retention.

Levers:

  • Metrics Retention: Adjust how long raw, high-resolution metrics are stored.
  • Logs Retention: Adjust how long logs are stored in hot storage.
  • Rehydration: If you need older data, you might have to pay to "rehydrate" it from cold storage, which can be slow and costly.

Diagnosis:

  • Metrics: Go to Metrics -> Summary. Under "Metric Retention," you’ll see the current retention periods.
  • Logs: Go to Logs -> Configuration -> Retention. This page shows your current log retention settings.

Fix:

  1. Metrics Retention:

    • Contact Datadog support to request changes to your metric retention. Datadog’s standard hot retention for metrics is often 15 months. You can negotiate shorter periods for specific metric types or for your entire account if 15 months is more than you need.
    • Example Request to Support: "We would like to reduce our standard metric retention from 15 months to 3 months for non-critical metrics, and 12 months for critical compliance metrics."
  2. Logs Retention:

    • Go to Logs -> Configuration -> Retention.
    • Adjust the slider or input field for "Log Retention."
    • Example: If your compliance requirement is 30 days, set the retention to "30 days."

Why it works: By reducing the duration for which data is kept in Datadog’s expensive hot storage, you directly reduce the overall storage costs associated with your account. Ensure you have a strategy for long-term archival or understand the implications of reduced hot retention.

The next error you’ll hit is a Rate Limiting error in your agent logs if you’re still sending too much data, or a Data not found error if you’ve reduced retention too aggressively.

Want structured learning?

Take the full Datadog course →