containerd is dropping connections to the underlying container runtime, often runc, because the runtime’s management socket is unavailable or corrupted. This prevents containerd from starting, stopping, or even listing containers and images.

Here’s how to fix it, starting with the most common culprits:

  1. Resource Exhaustion (OOM Killer): The most frequent offender is the Linux Out-Of-Memory (OOM) killer. containerd or its associated processes (like runc) are being killed by the kernel because the system is running out of RAM.

    • Diagnosis: Check dmesg for "Out of memory" messages and "Killed process" entries related to containerd, runc, or containerd-shim.
      sudo dmesg -T | grep -i 'oom\|killed'
      
    • Fix: Increase system RAM, reduce memory usage of other processes, or tune containerd’s memory limits if it’s running within a cgroup. If running on Kubernetes, check pod resource requests/limits.
    • Why it works: The OOM killer terminates processes to free up memory. By providing more memory or reducing demand, the kernel has less reason to kill containerd’s components.
  2. Corrupted containerd State/Storage: The directory where containerd stores its state and image data (/var/lib/containerd by default) can become corrupted due to disk errors, unexpected shutdowns, or filesystem issues.

    • Diagnosis: Look for I/O errors in dmesg or check filesystem integrity. Try to manually inspect the containerd state directory for unusual files or permissions.
      sudo find /var/lib/containerd -type f -size 0 -print # Look for zero-byte files
      sudo journalctl -u containerd -f # Watch for immediate errors on restart
      
    • Fix: The safest fix is to stop containerd, back up the important data (e.g., /var/lib/containerd/io.containerd.content.v1/content, /var/lib/containerd/io.containerd.snapshotter.v1/snapshots), and then remove and reinitialize the containerd state.
      sudo systemctl stop containerd
      # Backup critical data if needed
      sudo rm -rf /var/lib/containerd/content # Re-downloads images
      sudo rm -rf /var/lib/containerd/snapshot # Re-creates container filesystems
      sudo rm -rf /var/lib/containerd/state # Re-initializes containerd state
      sudo systemctl start containerd
      
    • Why it works: Corrupted data structures prevent containerd from correctly accessing its internal state or image layers. Re-initializing clears these corruptions, forcing containerd to rebuild its necessary files and re-download images/layers.
  3. Stale containerd Socket or PID File: containerd relies on a Unix domain socket for communication. If the process crashes uncleanly, its socket file or PID file might remain, preventing a new instance from starting or binding to the socket.

    • Diagnosis: Check for the existence of the containerd PID file and socket file.
      ls -l /run/containerd/containerd.sock
      ls -l /run/containerd/containerd.pid
      
    • Fix: Stop containerd if it’s running (even if it’s hanging), and then manually remove the stale socket and PID files before starting it again.
      sudo systemctl stop containerd
      sudo rm -f /run/containerd/containerd.sock
      sudo rm -f /run/containerd/containerd.pid
      sudo systemctl start containerd
      
    • Why it works: A new containerd process cannot create its socket or write its PID if these files already exist. Removing them allows the new process to correctly establish its communication channel and record its presence.
  4. Underlying Container Runtime (runc) Issues: Problems with the low-level container runtime, typically runc, can manifest as containerd errors. This could be due to runc being stuck, crashing, or having permission issues.

    • Diagnosis: Check journalctl for containerd errors mentioning runc or shim. Look for runc processes that are stuck or consuming excessive resources.
      sudo journalctl -u containerd -f | grep -i runc
      ps aux | grep runc
      
    • Fix: If runc processes are stuck, you may need to kill them. Ensure runc is installed correctly and has the necessary permissions. Sometimes, a full system reboot can clear stuck runc states.
      # Find and kill stuck runc processes
      sudo pkill -9 runc
      sudo systemctl restart containerd
      
    • Why it works: containerd orchestrates containers via runc. If runc is unresponsive or crashing, containerd cannot manage the container lifecycle, leading to errors. Killing stuck runc processes allows containerd to re-establish control.
  5. containerd Configuration Errors: Incorrect settings in containerd’s configuration file (/etc/containerd/config.toml) can cause it to fail on startup or behave erratically. This includes misconfigured network plugins, snapshotters, or registry mirrors.

    • Diagnosis: Review /etc/containerd/config.toml for syntax errors, incorrect paths, or invalid values. Pay close attention to the [plugins] section, especially io.containerd.grpc.v1.cri and io.containerd.runtime.v1.linux.
    • Fix: Correct any syntax errors or invalid configurations. If unsure, revert to a known good configuration or the default.
      # Example: Ensure the Cgroup driver matches your system (e.g., systemd)
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        SystemdCgroup = true
      
      After editing, restart containerd.
    • Why it works: containerd parses this file to configure its internal components and how it interacts with the system. Errors here prevent proper initialization or operation.
  6. Disk Space Full: If the disk partition hosting /var/lib/containerd or /var/lib/docker (if using Docker’s overlayfs with containerd) is full, containerd cannot write new image layers, container states, or logs.

    • Diagnosis: Check disk usage for the relevant partitions.
      df -h /var/lib/containerd
      df -h /var/lib/docker # If applicable
      
    • Fix: Free up disk space by removing old images, unused containers, or by expanding the partition.
      sudo crictl rmi --prune # Clean up dangling images
      sudo crictl rmp --prune # Clean up stopped containers
      
    • Why it works: containerd needs free space to store downloaded image layers, create container filesystems (using snapshotters), and write internal state. A full disk prevents these essential operations.

If you fix all of the above and still see issues, you might encounter Failed to retrieve container status errors as the system attempts to reconcile the state after the underlying containerd service has been restored.

Want structured learning?

Take the full Containerd course →