containerd is dropping connections to the underlying container runtime, often runc, because the runtime’s management socket is unavailable or corrupted. This prevents containerd from starting, stopping, or even listing containers and images.
Here’s how to fix it, starting with the most common culprits:
-
Resource Exhaustion (OOM Killer): The most frequent offender is the Linux Out-Of-Memory (OOM) killer. containerd or its associated processes (like
runc) are being killed by the kernel because the system is running out of RAM.- Diagnosis: Check
dmesgfor "Out of memory" messages and "Killed process" entries related tocontainerd,runc, orcontainerd-shim.sudo dmesg -T | grep -i 'oom\|killed' - Fix: Increase system RAM, reduce memory usage of other processes, or tune
containerd’s memory limits if it’s running within a cgroup. If running on Kubernetes, check pod resource requests/limits. - Why it works: The OOM killer terminates processes to free up memory. By providing more memory or reducing demand, the kernel has less reason to kill
containerd’s components.
- Diagnosis: Check
-
Corrupted
containerdState/Storage: The directory wherecontainerdstores its state and image data (/var/lib/containerdby default) can become corrupted due to disk errors, unexpected shutdowns, or filesystem issues.- Diagnosis: Look for I/O errors in
dmesgor check filesystem integrity. Try to manually inspect thecontainerdstate directory for unusual files or permissions.sudo find /var/lib/containerd -type f -size 0 -print # Look for zero-byte files sudo journalctl -u containerd -f # Watch for immediate errors on restart - Fix: The safest fix is to stop
containerd, back up the important data (e.g.,/var/lib/containerd/io.containerd.content.v1/content,/var/lib/containerd/io.containerd.snapshotter.v1/snapshots), and then remove and reinitialize thecontainerdstate.sudo systemctl stop containerd # Backup critical data if needed sudo rm -rf /var/lib/containerd/content # Re-downloads images sudo rm -rf /var/lib/containerd/snapshot # Re-creates container filesystems sudo rm -rf /var/lib/containerd/state # Re-initializes containerd state sudo systemctl start containerd - Why it works: Corrupted data structures prevent
containerdfrom correctly accessing its internal state or image layers. Re-initializing clears these corruptions, forcingcontainerdto rebuild its necessary files and re-download images/layers.
- Diagnosis: Look for I/O errors in
-
Stale
containerdSocket or PID File:containerdrelies on a Unix domain socket for communication. If the process crashes uncleanly, its socket file or PID file might remain, preventing a new instance from starting or binding to the socket.- Diagnosis: Check for the existence of the
containerdPID file and socket file.ls -l /run/containerd/containerd.sock ls -l /run/containerd/containerd.pid - Fix: Stop
containerdif it’s running (even if it’s hanging), and then manually remove the stale socket and PID files before starting it again.sudo systemctl stop containerd sudo rm -f /run/containerd/containerd.sock sudo rm -f /run/containerd/containerd.pid sudo systemctl start containerd - Why it works: A new
containerdprocess cannot create its socket or write its PID if these files already exist. Removing them allows the new process to correctly establish its communication channel and record its presence.
- Diagnosis: Check for the existence of the
-
Underlying Container Runtime (
runc) Issues: Problems with the low-level container runtime, typicallyrunc, can manifest ascontainerderrors. This could be due toruncbeing stuck, crashing, or having permission issues.- Diagnosis: Check
journalctlforcontainerderrors mentioningruncorshim. Look forruncprocesses that are stuck or consuming excessive resources.sudo journalctl -u containerd -f | grep -i runc ps aux | grep runc - Fix: If
runcprocesses are stuck, you may need to kill them. Ensureruncis installed correctly and has the necessary permissions. Sometimes, a full system reboot can clear stuckruncstates.# Find and kill stuck runc processes sudo pkill -9 runc sudo systemctl restart containerd - Why it works:
containerdorchestrates containers viarunc. Ifruncis unresponsive or crashing,containerdcannot manage the container lifecycle, leading to errors. Killing stuckruncprocesses allowscontainerdto re-establish control.
- Diagnosis: Check
-
containerdConfiguration Errors: Incorrect settings incontainerd’s configuration file (/etc/containerd/config.toml) can cause it to fail on startup or behave erratically. This includes misconfigured network plugins, snapshotters, or registry mirrors.- Diagnosis: Review
/etc/containerd/config.tomlfor syntax errors, incorrect paths, or invalid values. Pay close attention to the[plugins]section, especiallyio.containerd.grpc.v1.criandio.containerd.runtime.v1.linux. - Fix: Correct any syntax errors or invalid configurations. If unsure, revert to a known good configuration or the default.
After editing, restart# Example: Ensure the Cgroup driver matches your system (e.g., systemd) [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = truecontainerd. - Why it works:
containerdparses this file to configure its internal components and how it interacts with the system. Errors here prevent proper initialization or operation.
- Diagnosis: Review
-
Disk Space Full: If the disk partition hosting
/var/lib/containerdor/var/lib/docker(if using Docker’s overlayfs with containerd) is full,containerdcannot write new image layers, container states, or logs.- Diagnosis: Check disk usage for the relevant partitions.
df -h /var/lib/containerd df -h /var/lib/docker # If applicable - Fix: Free up disk space by removing old images, unused containers, or by expanding the partition.
sudo crictl rmi --prune # Clean up dangling images sudo crictl rmp --prune # Clean up stopped containers - Why it works:
containerdneeds free space to store downloaded image layers, create container filesystems (using snapshotters), and write internal state. A full disk prevents these essential operations.
- Diagnosis: Check disk usage for the relevant partitions.
If you fix all of the above and still see issues, you might encounter Failed to retrieve container status errors as the system attempts to reconcile the state after the underlying containerd service has been restored.