Your Datadog Agent is hogging CPU and memory, making your monitoring tool itself a performance problem. This usually means a specific check or integration is misbehaving, overwhelming the Agent’s processing or memory allocation.
Common Causes & Fixes
-
Too Many Checks Enabled/High Check Interval:
- Diagnosis: Check
go.d.plugin.logfor errors or repeated check executions. Also, examine the Agent’s configuration directory (/etc/datadog-agent/conf.d/) for an excessive number of enabled check configurations or very lowmin_collection_intervalvalues in yourdatadog.yaml. - Fix:
- Disable Unnecessary Checks: For checks you don’t need, either remove their
.yamlconfiguration file from/etc/datadog-agent/conf.d/or addenabled: falseto their specific configuration block withindatadog.yaml. - Increase Collection Interval: For essential checks, increase their
min_collection_interval. For example, to collect a check every 60 seconds instead of the default 15:# In datadog.yaml or a specific check's .yaml file min_collection_interval: 60
- Disable Unnecessary Checks: For checks you don’t need, either remove their
- Why it works: Each check consumes CPU and memory. Reducing the number of active checks or the frequency at which they run directly lowers the Agent’s resource footprint.
- Diagnosis: Check
-
Large Log Files Being Parsed/Aggregated:
- Diagnosis: Look for
log fileentries ingo.d.plugin.logthat show excessive processing or errors. Check your Datadog Agent configuration forlogs_enabled: trueand review thelog_processing_rulesindatadog.yaml. - Fix:
- Limit Log Files: Reduce the number of log files being tailed. In
datadog.yaml:logs: enabled: true # ... other log settings log_processing_rules: - type: include path: /var/log/myapp/*.log # Be specific! - Add Exclusion Rules: Exclude noisy or irrelevant log files.
logs: enabled: true # ... other log settings log_processing_rules: - type: exclude path: /var/log/myapp/debug.log # Exclude noisy debug logs - Increase
log_processing_rulesTimeout: If parsing is slow, increase the timeout for processing a single log file.logs: enabled: true # ... other log settings advanced: log_processing_timeout: 60 # seconds, default is 10
- Limit Log Files: Reduce the number of log files being tailed. In
- Why it works: The Agent’s log collection and processing is CPU and memory intensive. By limiting the volume or complexity of logs it processes, you reduce this load.
- Diagnosis: Look for
-
High Cardinality Tags:
- Diagnosis: Examine the "Tags" section of metrics in Datadog. If you see a massive number of unique tag values for a single metric (e.g.,
user_id:12345,user_id:12346, etc.), this is high cardinality. Checkgo.d.plugin.logfor messages related to tag processing. - Fix: Implement tag reduction strategies.
- Use
ddtraceor APM: For application-specific identifiers, leverage distributed tracing and APM, which are designed for this and often have less overhead than raw metric tags. - Tagging Policies: Define strict tagging policies. Avoid dynamic or high-cardinality tags like
request_idorsession_idon metrics. - Agent-Level Tagging: Use the
exclude_agent_tagsorinclude_agent_tagsindatadog.yamlto control which system-level tags are sent.# Example: Exclude specific system tags exclude_agent_tags: - docker_image - ec2_instance_id
- Use
- Why it works: Each unique tag combination creates a distinct time-series. High cardinality leads to an explosion of time-series, consuming significant Agent memory and CPU for tracking and aggregation.
- Diagnosis: Examine the "Tags" section of metrics in Datadog. If you see a massive number of unique tag values for a single metric (e.g.,
-
Container Autodiscovery Issues:
- Diagnosis: If running in a containerized environment (Docker, Kubernetes), check
go.d.plugin.logfor errors related to discovering containers or applying configurations. Look for excessivedocker.inspectorkubernetes.getcalls. - Fix:
- Simplify Autodiscovery Config: Reduce the complexity of your AD templates. Avoid overly broad or complex
ad_template_regexpatterns. - Limit Discovery Scope: If possible, limit the scope of autodiscovery (e.g., by namespace in Kubernetes, or by specific labels in Docker).
- Check Docker/Kubernetes API Rate Limits: Ensure the Agent isn’t hitting API rate limits on your container orchestrator, which can cause retries and increased CPU.
- Simplify Autodiscovery Config: Reduce the complexity of your AD templates. Avoid overly broad or complex
- Why it works: The Agent constantly queries the container orchestrator for metadata to apply configurations. Inefficient discovery or excessive polling can become a major resource drain.
- Diagnosis: If running in a containerized environment (Docker, Kubernetes), check
-
Large Number of Enabled Integrations/Checks:
- Diagnosis: Run
datadog-agent check --listto see all configured checks. Count them. If the list is hundreds long, this is a likely culprit. - Fix:
- Disable Unused Integrations: For integrations you’ve installed but aren’t actively using, disable them. This is done by removing or disabling their
.yamlfiles in/etc/datadog-agent/conf.d/as described in cause #1. - Consolidate Checks: Where possible, use checks that can gather multiple related metrics instead of many single-metric checks.
- Disable Unused Integrations: For integrations you’ve installed but aren’t actively using, disable them. This is done by removing or disabling their
- Why it works: Each enabled integration, even if not actively collecting data, consumes some baseline memory and CPU for initialization and periodic checks.
- Diagnosis: Run
-
Agent Configuration Loading Issues:
- Diagnosis: Examine
go.d.plugin.logfor frequent "reloading configuration" messages or errors during configuration parsing. - Fix: Ensure your
datadog.yamland any custom check.yamlfiles are syntactically correct. Malformed YAML can cause the Agent to repeatedly try and fail to load its configuration.# Example: Validate YAML syntax yamllint /etc/datadog-agent/datadog.yaml - Why it works: A constantly failing configuration load loop will consume significant CPU as the Agent repeatedly attempts the same failed operation.
- Diagnosis: Examine
After resolving these, the next error you’ll likely encounter is dogstatsd being overwhelmed if you’re sending a very high volume of DogStatsD metrics, or potentially agent timeouts on specific checks if the underlying system is slow to respond.