The Datadog Agent’s disk check is failing because the Agent process is unable to access the filesystem metrics it needs, often due to permissions or filesystem corruption.
Here’s a breakdown of the common culprits and how to fix them:
1. Permissions Issues on /proc or /sys
The Datadog Agent, especially when running in containers or restricted environments, might lack the necessary read permissions for critical system directories like /proc and /sys. These directories are where the kernel exposes a wealth of information, including disk I/O statistics.
-
Diagnosis: Run
sudo -u dd-agent ls -l /proc/mountsandsudo -u dd-agent ls -l /sys/block. If you see "Permission denied" errors, this is your problem. Replacedd-agentwith the actual user the Datadog Agent runs as. -
Fix: Ensure the Datadog Agent user has read access to these directories. If running as a non-root user, add them to a group that has read access, or adjust the directory permissions (use with caution, as it affects the entire system). For containerized agents, this often means ensuring the container is started with appropriate volume mounts or capabilities. For example, in Docker:
docker run -v /proc:/host/proc:ro -v /sys:/host/sys:ro ... datadog/agent:latestThen, configure the
diskcheck to read from these mounted paths (e.g.,proc_path: /host/proc). -
Why it works: By granting read permissions, you allow the
diskcheck’s underlying OS calls (likestatfsor reading from/proc/diskstats) to succeed, retrieving the necessary filesystem metrics.
2. Filesystem Corruption or Inode Issues
A corrupted filesystem or issues with inodes can prevent the kernel from accurately reporting disk usage or I/O statistics. This is a more serious underlying system problem.
-
Diagnosis: Check system logs for
fsckerrors or other filesystem-related warnings. Rundf -handdu -sh /to see if reported disk usage seems wildly inaccurate. The Datadog Agent might reportStale file handleor similar errors if it’s trying to access a corrupted inode. -
Fix: Schedule downtime and run
fsckon the affected filesystem. For example, to check and repair/dev/sda1:sudo umount /mnt/data # Unmount the filesystem first sudo fsck /dev/sda1After the repair, reboot the system and check Datadog again.
-
Why it works:
fsck(filesystem check) scans the filesystem for inconsistencies, errors, and corrupted data structures, repairing them and restoring the filesystem to a healthy state. This allows the OS to correctly report metrics.
3. Mount Point Issues (e.g., Network Mounts)
If the disk check is configured to monitor a network mount (like NFS or CIFS) that is temporarily unavailable or has a stale mount, the check can fail.
-
Diagnosis: Run
mountand look for the specific mount point. Check its status. Tryls -l /path/to/mountpoint. If it hangs or returns errors, the mount is likely the issue. The Datadog Agent logs might show errors related to I/O timeouts orStale file handle. -
Fix: Unmount and remount the problematic network share.
sudo umount /path/to/mountpoint sudo mount -a # Or use the specific mount command for your shareIf the issue is persistent, investigate the network share’s availability and the NFS/CIFS server health.
-
Why it works: Re-establishing a clean connection to the network share ensures that the underlying filesystem operations can complete successfully, allowing Datadog to collect metrics.
4. Incorrect disk Check Configuration (Exclusion/Inclusion)
The disk check has excluded_fs and excluded_devices options. If these are misconfigured, they might be excluding the very filesystem the Agent is trying to monitor, or including temporary/virtual filesystems that cause issues.
-
Diagnosis: Review your
datadog.yamlor thedisk.d/conf.yamlfile for anyexcluded_fsorexcluded_devicespatterns that might be too broad or accidentally match your active disks. Check the Agent’sconf.d/disk.d/conf.yamlfile. -
Fix: Adjust the
excluded_fsorexcluded_deviceslists to be more specific. For example, to ensure all local block devices are monitored and exclude only tmpfs:init_config: instances: # Monitor all local disks, excluding tmpfs - include_filesystem: - ext4 - xfs - btrfs excluded_fs: - tmpfs - devtmpfsReload the Datadog Agent configuration:
sudo systemctl reload datadog-agent. -
Why it works: Correctly defining what to include and exclude ensures the Agent focuses its monitoring efforts on relevant and accessible filesystems, avoiding errors caused by attempting to monitor non-existent or problematic ones.
5. Insufficient System Resources (High Load/OOM)
In extremely high-load scenarios, the system might be struggling to serve filesystem information, or the Datadog Agent itself might be starved of resources, leading to failed checks.
-
Diagnosis: Check overall system CPU and memory usage. Look for the Datadog Agent process (
pidof datadog-agent) and see its resource consumption. Checkdmesgfor Out-Of-Memory (OOM) killer activity. -
Fix: Optimize system performance, reduce load, or increase resources (CPU/RAM). If the Datadog Agent is consuming too much, consider adjusting its
min_collection_intervalor disabling less critical checks. To increase memory for the agent in some deployments:# Example for systemd service file modification sudo systemctl edit datadog-agent # Add or modify LimitAS, LimitRSS, LimitSWAP as needed # Example: [Service] LimitAS=infinity LimitRSS=infinityThen
sudo systemctl daemon-reload && sudo systemctl restart datadog-agent. -
Why it works: Ensuring the Datadog Agent has sufficient resources to run its checks and that the underlying OS can respond to metric requests prevents transient failures due to resource contention.
6. Corrupted Datadog Agent Configuration Files
While less common, the Datadog Agent’s own configuration files for the disk check might have become corrupted or contain invalid syntax.
-
Diagnosis: Inspect the contents of
/etc/datadog-agent/conf.d/disk.d/conf.yamland any otherconf.yamlfiles within that directory. Look for malformed YAML, incorrect indentation, or invalid parameter values. -
Fix: Restore the
conf.yamlfile from a backup or a known good state. If unsure, you can temporarily rename the file to disable the check and see if the error disappears, then re-create it with valid syntax. Example of a minimal validconf.yaml:init_config: instances: -After fixing, reload the agent:
sudo systemctl reload datadog-agent. -
Why it works: Correctly formatted configuration files allow the Datadog Agent to parse its settings and execute the
diskcheck logic without encountering parsing errors.
After resolving these issues, the next error you might encounter is related to the systemd check if it’s also failing, or potentially a kubelet check if you’re in a Kubernetes environment and other node-level metrics are affected.