The Datadog Agent’s disk check is failing because the Agent process is unable to access the filesystem metrics it needs, often due to permissions or filesystem corruption.

Here’s a breakdown of the common culprits and how to fix them:

1. Permissions Issues on /proc or /sys

The Datadog Agent, especially when running in containers or restricted environments, might lack the necessary read permissions for critical system directories like /proc and /sys. These directories are where the kernel exposes a wealth of information, including disk I/O statistics.

  • Diagnosis: Run sudo -u dd-agent ls -l /proc/mounts and sudo -u dd-agent ls -l /sys/block. If you see "Permission denied" errors, this is your problem. Replace dd-agent with the actual user the Datadog Agent runs as.

  • Fix: Ensure the Datadog Agent user has read access to these directories. If running as a non-root user, add them to a group that has read access, or adjust the directory permissions (use with caution, as it affects the entire system). For containerized agents, this often means ensuring the container is started with appropriate volume mounts or capabilities. For example, in Docker:

    docker run -v /proc:/host/proc:ro -v /sys:/host/sys:ro ... datadog/agent:latest
    

    Then, configure the disk check to read from these mounted paths (e.g., proc_path: /host/proc).

  • Why it works: By granting read permissions, you allow the disk check’s underlying OS calls (like statfs or reading from /proc/diskstats) to succeed, retrieving the necessary filesystem metrics.

2. Filesystem Corruption or Inode Issues

A corrupted filesystem or issues with inodes can prevent the kernel from accurately reporting disk usage or I/O statistics. This is a more serious underlying system problem.

  • Diagnosis: Check system logs for fsck errors or other filesystem-related warnings. Run df -h and du -sh / to see if reported disk usage seems wildly inaccurate. The Datadog Agent might report Stale file handle or similar errors if it’s trying to access a corrupted inode.

  • Fix: Schedule downtime and run fsck on the affected filesystem. For example, to check and repair /dev/sda1:

    sudo umount /mnt/data  # Unmount the filesystem first
    sudo fsck /dev/sda1
    

    After the repair, reboot the system and check Datadog again.

  • Why it works: fsck (filesystem check) scans the filesystem for inconsistencies, errors, and corrupted data structures, repairing them and restoring the filesystem to a healthy state. This allows the OS to correctly report metrics.

3. Mount Point Issues (e.g., Network Mounts)

If the disk check is configured to monitor a network mount (like NFS or CIFS) that is temporarily unavailable or has a stale mount, the check can fail.

  • Diagnosis: Run mount and look for the specific mount point. Check its status. Try ls -l /path/to/mountpoint. If it hangs or returns errors, the mount is likely the issue. The Datadog Agent logs might show errors related to I/O timeouts or Stale file handle.

  • Fix: Unmount and remount the problematic network share.

    sudo umount /path/to/mountpoint
    sudo mount -a # Or use the specific mount command for your share
    

    If the issue is persistent, investigate the network share’s availability and the NFS/CIFS server health.

  • Why it works: Re-establishing a clean connection to the network share ensures that the underlying filesystem operations can complete successfully, allowing Datadog to collect metrics.

4. Incorrect disk Check Configuration (Exclusion/Inclusion)

The disk check has excluded_fs and excluded_devices options. If these are misconfigured, they might be excluding the very filesystem the Agent is trying to monitor, or including temporary/virtual filesystems that cause issues.

  • Diagnosis: Review your datadog.yaml or the disk.d/conf.yaml file for any excluded_fs or excluded_devices patterns that might be too broad or accidentally match your active disks. Check the Agent’s conf.d/disk.d/conf.yaml file.

  • Fix: Adjust the excluded_fs or excluded_devices lists to be more specific. For example, to ensure all local block devices are monitored and exclude only tmpfs:

    init_config:
    
    instances:
      # Monitor all local disks, excluding tmpfs
      - include_filesystem:
        - ext4
        - xfs
        - btrfs
        excluded_fs:
          - tmpfs
          - devtmpfs
    

    Reload the Datadog Agent configuration: sudo systemctl reload datadog-agent.

  • Why it works: Correctly defining what to include and exclude ensures the Agent focuses its monitoring efforts on relevant and accessible filesystems, avoiding errors caused by attempting to monitor non-existent or problematic ones.

5. Insufficient System Resources (High Load/OOM)

In extremely high-load scenarios, the system might be struggling to serve filesystem information, or the Datadog Agent itself might be starved of resources, leading to failed checks.

  • Diagnosis: Check overall system CPU and memory usage. Look for the Datadog Agent process (pidof datadog-agent) and see its resource consumption. Check dmesg for Out-Of-Memory (OOM) killer activity.

  • Fix: Optimize system performance, reduce load, or increase resources (CPU/RAM). If the Datadog Agent is consuming too much, consider adjusting its min_collection_interval or disabling less critical checks. To increase memory for the agent in some deployments:

    # Example for systemd service file modification
    sudo systemctl edit datadog-agent
    # Add or modify LimitAS, LimitRSS, LimitSWAP as needed
    # Example:
    [Service]
    LimitAS=infinity
    LimitRSS=infinity
    

    Then sudo systemctl daemon-reload && sudo systemctl restart datadog-agent.

  • Why it works: Ensuring the Datadog Agent has sufficient resources to run its checks and that the underlying OS can respond to metric requests prevents transient failures due to resource contention.

6. Corrupted Datadog Agent Configuration Files

While less common, the Datadog Agent’s own configuration files for the disk check might have become corrupted or contain invalid syntax.

  • Diagnosis: Inspect the contents of /etc/datadog-agent/conf.d/disk.d/conf.yaml and any other conf.yaml files within that directory. Look for malformed YAML, incorrect indentation, or invalid parameter values.

  • Fix: Restore the conf.yaml file from a backup or a known good state. If unsure, you can temporarily rename the file to disable the check and see if the error disappears, then re-create it with valid syntax. Example of a minimal valid conf.yaml:

    init_config:
    
    instances:
      -
    

    After fixing, reload the agent: sudo systemctl reload datadog-agent.

  • Why it works: Correctly formatted configuration files allow the Datadog Agent to parse its settings and execute the disk check logic without encountering parsing errors.

After resolving these issues, the next error you might encounter is related to the systemd check if it’s also failing, or potentially a kubelet check if you’re in a Kubernetes environment and other node-level metrics are affected.

Want structured learning?

Take the full Datadog course →