Datadog’s Cloud Workload Security (CWS) agent is failing to send security events to the Datadog backend, leading to gaps in your security visibility. This happens because the agent, running as a privileged daemon on your hosts, is experiencing resource exhaustion or network connectivity issues that prevent it from properly transmitting its findings.
Common Causes and Fixes:
1. Agent Resource Limits (CPU/Memory)
- Diagnosis: Check the resource utilization of the Datadog agent process on affected hosts.
Examine the Datadog agent logs for OOM (Out-Of-Memory) killer messages or resource constraint warnings.ps aux | grep datadog-agent # Look for high %CPU or %MEM for the datadog-agent process # Alternatively, use 'top' or 'htop'sudo journalctl -u datadog-agent -f # or check /var/log/datadog/agent.log - Fix: Increase the resource allocation for the Datadog agent.
- Kubernetes: Adjust the
resources.limitsfor thedatadog-agentdeployment in your Helm values or Kubernetes manifest. For example, to increase CPU to 500m and memory to 1Gi:datadog: agent: resources: requests: cpu: "200m" memory: "512Mi" limits: cpu: "500m" memory: "1Gi" - VMs/Bare Metal: Modify the
datadog-agent.conffile (typically located at/etc/datadog-agent/datadog.confor/opt/datadog-agent/etc/datadog-agent.conf). While there isn’t a direct CPU/memory limit setting here for the agent itself in the same way as Kubernetes, you can influence its behavior by adjusting logging levels or disabling certain integrations if they are causing excessive load. More commonly, you’d address this at the host level by ensuring the host has sufficient resources or by tuning the OS’s process scheduler.
- Kubernetes: Adjust the
- Why it works: The Datadog agent, especially with CWS enabled, can be resource-intensive. If it’s being throttled by the container orchestrator or the host OS due to insufficient CPU or memory, it might not be able to process and send security events reliably. Providing more resources allows the agent to operate within its requirements.
2. Network Connectivity Issues to Datadog Endpoints
- Diagnosis: The agent needs to reach Datadog’s intake servers. Use
curlorwgetfrom the affected host to test connectivity to the relevant Datadog API endpoints. The exact endpoint depends on your Datadog site and region (e.g.,agent-http-intake.logs.datadoghq.comfor US1).
Check firewall rules, security groups, and network ACLs on your cloud provider (AWS, Azure, GCP) and any on-premises firewalls. Ensure outbound traffic on port 443 (HTTPS) is allowed to Datadog’s IP ranges.# Example for US1 site, logs intake curl -v https://agent-http-intake.logs.datadoghq.com/v1/input/ # Example for US1 site, metrics intake curl -v https://<YOUR_API_KEY>@api.datadoghq.com/api/v1/series - Fix: Open necessary outbound ports and ensure no network policies are blocking communication.
- Cloud Providers: In AWS Security Groups, Azure Network Security Groups, or GCP Firewall Rules, add an outbound rule allowing TCP traffic on port 443 to the CIDR blocks used by Datadog. You can find Datadog’s current IP ranges in their documentation.
- On-Premises: Configure your network firewall to allow outbound traffic on port 443 to Datadog’s IP ranges.
- Why it works: The Datadog agent transmits security events over HTTPS to Datadog’s ingestion endpoints. If these connections are blocked by network security controls, the events cannot be sent and will be dropped.
3. Insufficient Disk Space for Agent Log/State Files
- Diagnosis: The Datadog agent writes logs and state information to disk. If the partition where these files reside becomes full, the agent can fail to operate correctly.
Examine Datadog agent logs for errors related to writing to files, such as "no space left on device."df -h # Check usage for partitions like /var/log, /opt/datadog-agent, or the agent's data directory. - Fix: Free up disk space on the affected partition or increase its size.
- Linux: Remove old log files (e.g., rotated agent logs), temporary files, or unneeded data.
- Cloud VMs: Resize the attached volume or add a new volume and configure the agent to use it.
- Why it works: The agent needs to write temporary data, cache events before sending, and log its operations. A full disk prevents these essential file operations, halting the agent’s ability to function and transmit data.
4. Incorrect Datadog API Key or Agent Configuration
- Diagnosis: The agent needs a valid API key to authenticate with Datadog. Incorrect or expired keys will cause connection failures.
Examine Datadog agent logs for authentication errors or messages indicating an invalid API key.# Check the agent configuration file for the api_key setting sudo grep "api_key" /etc/datadog-agent/datadog.conf # or check your Kubernetes manifest/Helm values - Fix: Ensure the
api_keyin the Datadog agent’s configuration is correct and hasn’t expired.- Kubernetes: Update the
datadog.apiKeyin your Helm values or the Kubernetes secret used by the agent. - VMs/Bare Metal: Edit
/etc/datadog-agent/datadog.confand replace theapi_keywith the correct one. - Restart the Datadog agent after making changes:
sudo systemctl restart datadog-agent.
- Kubernetes: Update the
- Why it works: The API key is the agent’s credential for communicating with Datadog. Without a valid key, the agent cannot authenticate with the Datadog API, and all its data submissions will be rejected.
5. Agent Version Compatibility or Bugs
- Diagnosis: Older versions of the Datadog agent might have known bugs related to CWS event handling or compatibility issues with newer Datadog features or host operating systems. Check the Datadog agent release notes for known issues.
Review Datadog agent logs for any specific error messages that might indicate a bug.# Check current agent version sudo datadog-agent version - Fix: Upgrade the Datadog agent to the latest stable version.
- Kubernetes: Update your Helm chart version or the agent image tag in your deployment.
- VMs/Bare Metal: Follow the Datadog upgrade instructions for your specific OS (e.g.,
apt update && apt install datadog-agentfor Debian/Ubuntu,yum update datadog-agentfor RHEL/CentOS).
- Why it works: Datadog regularly releases updates that fix bugs, improve performance, and ensure compatibility with evolving environments. Upgrading ensures you have the most robust and supported version of the agent.
6. CWS Kernel Module Issues (Linux)
- Diagnosis: Cloud Workload Security relies on a kernel module for deep inspection. If this module fails to load or encounters errors, CWS events will not be generated or processed correctly.
Examine Datadog agent logs for messages indicating problems with the kernel module.# Check if the module is loaded lsmod | grep datadog_cws # Check kernel logs for errors related to the module sudo dmesg | grep datadog_cws - Fix: Reinstall or reload the Datadog kernel module.
- Datadog Agent v7.43+: The agent often handles module loading automatically. A simple agent restart might suffice:
sudo systemctl restart datadog-agent. If not, you might need to trigger a module rebuild/reload, which can sometimes be done via agent configuration or by reinstalling the agent package. - Older Agents/Manual Installation: You might need to manually load the module (
sudo insmod /path/to/datadog_cws.ko) or rebuild it against your current kernel (sudo datadog-agent kmod enable --rebuild). Consult Datadog’s documentation for your specific OS and agent version.
- Datadog Agent v7.43+: The agent often handles module loading automatically. A simple agent restart might suffice:
- Why it works: The CWS kernel module is the eyes and ears for security events at the kernel level. If it’s not loaded or functioning correctly, the agent has no visibility into the system’s security-relevant activities, and thus no events can be generated or sent.
After applying these fixes, you should monitor your Datadog Security -> Events page to confirm that new events are appearing. The next common issue you might encounter is related to event processing delays or missing specific event types if the agent is still under heavy load or if specific CWS detection rules are misconfigured.