Your Datadog Agent is hogging CPU and memory, making your monitoring tool itself a performance problem. This usually means a specific check or integration is misbehaving, overwhelming the Agent’s processing or memory allocation.

Common Causes & Fixes

  1. Too Many Checks Enabled/High Check Interval:

    • Diagnosis: Check go.d.plugin.log for errors or repeated check executions. Also, examine the Agent’s configuration directory (/etc/datadog-agent/conf.d/) for an excessive number of enabled check configurations or very low min_collection_interval values in your datadog.yaml.
    • Fix:
      • Disable Unnecessary Checks: For checks you don’t need, either remove their .yaml configuration file from /etc/datadog-agent/conf.d/ or add enabled: false to their specific configuration block within datadog.yaml.
      • Increase Collection Interval: For essential checks, increase their min_collection_interval. For example, to collect a check every 60 seconds instead of the default 15:
        # In datadog.yaml or a specific check's .yaml file
        min_collection_interval: 60
        
    • Why it works: Each check consumes CPU and memory. Reducing the number of active checks or the frequency at which they run directly lowers the Agent’s resource footprint.
  2. Large Log Files Being Parsed/Aggregated:

    • Diagnosis: Look for log file entries in go.d.plugin.log that show excessive processing or errors. Check your Datadog Agent configuration for logs_enabled: true and review the log_processing_rules in datadog.yaml.
    • Fix:
      • Limit Log Files: Reduce the number of log files being tailed. In datadog.yaml:
        logs:
          enabled: true
          # ... other log settings
          log_processing_rules:
            - type: include
              path: /var/log/myapp/*.log # Be specific!
        
      • Add Exclusion Rules: Exclude noisy or irrelevant log files.
        logs:
          enabled: true
          # ... other log settings
          log_processing_rules:
            - type: exclude
              path: /var/log/myapp/debug.log # Exclude noisy debug logs
        
      • Increase log_processing_rules Timeout: If parsing is slow, increase the timeout for processing a single log file.
        logs:
          enabled: true
          # ... other log settings
          advanced:
            log_processing_timeout: 60 # seconds, default is 10
        
    • Why it works: The Agent’s log collection and processing is CPU and memory intensive. By limiting the volume or complexity of logs it processes, you reduce this load.
  3. High Cardinality Tags:

    • Diagnosis: Examine the "Tags" section of metrics in Datadog. If you see a massive number of unique tag values for a single metric (e.g., user_id:12345, user_id:12346, etc.), this is high cardinality. Check go.d.plugin.log for messages related to tag processing.
    • Fix: Implement tag reduction strategies.
      • Use ddtrace or APM: For application-specific identifiers, leverage distributed tracing and APM, which are designed for this and often have less overhead than raw metric tags.
      • Tagging Policies: Define strict tagging policies. Avoid dynamic or high-cardinality tags like request_id or session_id on metrics.
      • Agent-Level Tagging: Use the exclude_agent_tags or include_agent_tags in datadog.yaml to control which system-level tags are sent.
        # Example: Exclude specific system tags
        exclude_agent_tags:
          - docker_image
          - ec2_instance_id
        
    • Why it works: Each unique tag combination creates a distinct time-series. High cardinality leads to an explosion of time-series, consuming significant Agent memory and CPU for tracking and aggregation.
  4. Container Autodiscovery Issues:

    • Diagnosis: If running in a containerized environment (Docker, Kubernetes), check go.d.plugin.log for errors related to discovering containers or applying configurations. Look for excessive docker.inspect or kubernetes.get calls.
    • Fix:
      • Simplify Autodiscovery Config: Reduce the complexity of your AD templates. Avoid overly broad or complex ad_template_regex patterns.
      • Limit Discovery Scope: If possible, limit the scope of autodiscovery (e.g., by namespace in Kubernetes, or by specific labels in Docker).
      • Check Docker/Kubernetes API Rate Limits: Ensure the Agent isn’t hitting API rate limits on your container orchestrator, which can cause retries and increased CPU.
    • Why it works: The Agent constantly queries the container orchestrator for metadata to apply configurations. Inefficient discovery or excessive polling can become a major resource drain.
  5. Large Number of Enabled Integrations/Checks:

    • Diagnosis: Run datadog-agent check --list to see all configured checks. Count them. If the list is hundreds long, this is a likely culprit.
    • Fix:
      • Disable Unused Integrations: For integrations you’ve installed but aren’t actively using, disable them. This is done by removing or disabling their .yaml files in /etc/datadog-agent/conf.d/ as described in cause #1.
      • Consolidate Checks: Where possible, use checks that can gather multiple related metrics instead of many single-metric checks.
    • Why it works: Each enabled integration, even if not actively collecting data, consumes some baseline memory and CPU for initialization and periodic checks.
  6. Agent Configuration Loading Issues:

    • Diagnosis: Examine go.d.plugin.log for frequent "reloading configuration" messages or errors during configuration parsing.
    • Fix: Ensure your datadog.yaml and any custom check .yaml files are syntactically correct. Malformed YAML can cause the Agent to repeatedly try and fail to load its configuration.
      # Example: Validate YAML syntax
      yamllint /etc/datadog-agent/datadog.yaml
      
    • Why it works: A constantly failing configuration load loop will consume significant CPU as the Agent repeatedly attempts the same failed operation.

After resolving these, the next error you’ll likely encounter is dogstatsd being overwhelmed if you’re sending a very high volume of DogStatsD metrics, or potentially agent timeouts on specific checks if the underlying system is slow to respond.

Want structured learning?

Take the full Datadog course →