Datadog’s trace agent is dropping trace chunks because its internal queue is full, preventing the agent from sending valuable performance data to the Datadog backend.

This usually happens when the agent is overwhelmed by the volume of traces it’s receiving, or when its connection to the Datadog intake servers is degraded.

Cause 1: Agent Overload - Too Many Traces

The most common culprit is simply too many traces being generated by your application. This can be a sudden spike or a sustained high volume.

Diagnosis: Check the Datadog agent’s own metrics. On the agent host, run:

sudo datadog-agent status

Look for trace.agent.queue_size and trace.agent.traces_dropped. If queue_size is consistently high and traces_dropped is increasing, the agent is drowning.

Fix:

  1. Increase Agent Resources: If running the agent in a container, allocate more CPU and memory. For example, in Kubernetes, adjust the resources.requests and resources.limits for the Datadog agent pod.

    resources:
      requests:
        cpu: "500m"
        memory: "1Gi"
      limits:
        cpu: "1"
        memory: "2Gi"
    

    This gives the agent more raw processing power and memory to handle the load.

  2. Tune Trace Agent Configuration: In datadog.yaml (or equivalent for your deployment), you can adjust max_traces_per_second and max_processing_threads.

    trace:
      max_traces_per_second: 10000 # Default is often 5000
      max_processing_threads: 8    # Default is often 4
    

    Increasing max_traces_per_second allows the agent to accept more traces before its queue fills up. Increasing max_processing_threads allows it to process incoming traces faster. Be careful not to set these too high, as it can starve your application or other agent functions.

Why it works: More resources or a higher ingestion rate directly combats the "too many traces" problem by allowing the agent to handle more data.

Cause 2: Network Issues to Datadog Intake

The agent might be healthy, but it can’t send data out. This could be due to network connectivity problems, firewall rules, or proxy misconfigurations.

Diagnosis: From the agent host, try to reach the Datadog intake endpoint. For example, for US1:

curl -v https://trace.agent.datadoghq.com

Look for connection timeouts, SSL handshake failures, or HTTP error codes (like 4xx or 5xx).

Also, check the agent logs for network-related errors:

sudo tail -f /var/log/datadog/trace-agent.log

Look for messages indicating connection refused, timeouts, or SSL errors.

Fix:

  1. Verify Network Connectivity: Ensure the agent host can reach *.datadoghq.com (or your specific Datadog domain) on port 443. Check firewall rules, security groups, and routing tables.
  2. Configure Proxy: If your network requires a proxy, ensure it’s correctly configured in datadog.yaml:
    proxy:
      http: http://your_proxy_address:port
      https: https://your_proxy_address:port
    
    Ensure the proxy allows outbound connections to Datadog.

Why it works: Restoring or enabling proper network paths allows the agent to successfully transmit its buffered trace data to Datadog.

Cause 3: Degraded Datadog Backend Performance

Less common, but possible, is that the Datadog backend itself is experiencing issues, leading to slower ingestion and causing the agent’s queues to back up.

Diagnosis: Check the Datadog Status page for any ongoing incidents affecting trace ingestion or the entire platform. Also, examine the Datadog agent logs for repeated errors like:

"POST /v0.1/traces failed with 503 Service Unavailable"

or similar HTTP error codes that indicate backend problems.

Fix: This is usually outside your direct control.

  1. Monitor Datadog Status: Keep an eye on the official Datadog Status page.
  2. Retry Mechanism: The Datadog agent has a built-in retry mechanism. Ensure it’s not disabled and that the agent is configured to retry. The default trace.max_concurrent_requests and trace.retry_delay settings are usually sufficient.
  3. Contact Datadog Support: If you suspect a backend issue and the status page shows no incidents, open a ticket with Datadog support.

Why it works: The agent’s retry logic will eventually succeed when the backend recovers, and by contacting support, you help Datadog identify and resolve the issue.

Cause 4: Agent Configuration Errors

Misconfigurations in the datadog.yaml file, especially related to tracing settings, can lead to unexpected behavior.

Diagnosis: Carefully review your datadog.yaml for any custom trace-related settings. Common mistakes include incorrect values for max_traces_per_second, max_processing_threads, or trace.log_throttling which might be too aggressive.

Fix:

  1. Reset to Defaults: Temporarily revert custom trace settings in datadog.yaml to their default values.
    # Example: Temporarily comment out or reset these
    # trace:
    #   max_traces_per_second: 5000
    #   max_processing_threads: 4
    
  2. Restart Agent: After changing datadog.yaml, restart the agent:
    sudo datadog-agent restart
    
  3. Gradual Tuning: If resetting to defaults resolves the issue, reintroduce your custom settings one by one, monitoring the agent status after each change.

Why it works: Incorrect configurations can cause the agent to mismanage its resources or processing, leading to queue buildup and dropped traces. Returning to known good settings helps isolate the problem.

Cause 5: High CPU/Memory Utilization on Agent Host

If the host running the Datadog agent is starved for CPU or memory, the agent processes will be slow or unresponsive, leading to queue buildup.

Diagnosis: Use standard system monitoring tools to check the CPU and memory usage of the host. On Linux:

top -o %CPU
top -o %MEM

Look for processes consuming excessive resources, or overall high utilization (consistently above 80-90%).

Fix:

  1. Increase Host Resources: If the host is undersized, upgrade its CPU or RAM.
  2. Reduce Host Load: Identify and optimize or reduce the resource consumption of other applications running on the same host.
  3. Dedicated Agent Host: If possible, run the Datadog agent on a dedicated host or node to prevent contention with application workloads.

Why it works: Providing sufficient system resources allows the Datadog agent’s processes to run efficiently and process traces without being throttled by the operating system.

Cause 6: Outdated Agent Version

Older versions of the Datadog agent might have bugs or less efficient queuing mechanisms that are more prone to this issue.

Diagnosis: Check the currently installed Datadog agent version:

sudo datadog-agent version

Compare this to the latest stable version available from Datadog’s documentation.

Fix: Upgrade the Datadog agent to the latest stable version following Datadog’s upgrade guide for your specific operating system or orchestration platform.

Why it works: Newer versions often contain performance improvements and bug fixes that directly address issues like queue management and resource utilization.

After resolving the full queue issue, you might encounter error: failed to send metrics: context deadline exceeded if network latency is still a problem or if the Datadog backend is temporarily slow to respond.

Want structured learning?

Take the full Datadog course →