The containerd daemon, essential for orchestrating containers on your Kubernetes nodes, has stopped responding, preventing new pods from starting and existing ones from being managed.

Here are the most common reasons this happens and how to fix them:

1. containerd Service Crashed Due to Resource Exhaustion

Diagnosis: Check the system journal for containerd errors.

sudo journalctl -u containerd -n 500 --no-pager

Look for messages indicating memory or CPU spikes, or out-of-memory (OOM) killer events. Also, check overall system resource usage:

top -bn1 | grep "Cpu(s)\|Mem"

If CPU is consistently near 100% or memory is nearly full, this is likely the culprit.

Fix:

  • Increase Node Resources: If possible, scale up your node’s CPU or memory.
  • Identify Resource Hogs: Use docker stats (if Docker is also installed and running) or htop to find processes consuming excessive resources. This could be other containerd processes, a rogue application container, or even system daemons.
  • Tune containerd Configuration: If containerd itself is the hog, you might need to adjust its resource limits in /etc/containerd/config.toml. For example, to limit its CPU usage:
    [plugins."io.containerd.grpc.v1.cri".registry]
      # ... other config
      [plugins."io.containerd.grpc.v1.cri".registry.configs]
        [plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io"]
          tls_verify = false
          [plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io".auth]
            # ... auth config
    
    While containerd’s core config doesn’t directly have CPU/memory limits for the daemon itself in a simple way, you can limit the resources available to the containers it runs. However, if the daemon itself is OOM-killing, it’s usually a system-level resource issue or a bug in containerd.
  • Restart containerd: After addressing resource issues, restart the service.
    sudo systemctl restart containerd
    
    This allows containerd to re-initialize with available resources.

2. Corrupted containerd State or Configuration

Diagnosis: Examine the containerd log file for specific errors related to state files or configuration parsing.

sudo journalctl -u containerd -n 500 --no-pager

Look for errors like "failed to load state," "invalid configuration," or file access permission issues.

Fix:

  • Reset containerd State (Use with Caution): If you suspect state corruption, you can try stopping containerd, backing up, and removing its state directory, then restarting.
    sudo systemctl stop containerd
    sudo mv /var/lib/containerd /var/lib/containerd.bak_$(date +%Y%m%d_%H%M%S)
    sudo systemctl start containerd
    
    This will cause containerd to rebuild its state from scratch. Any running containers will likely be terminated and will need to be recreated by Kubernetes.
  • Validate containerd Configuration: Ensure /etc/containerd/config.toml is syntactically correct and follows the expected schema. You can use containerd config dump to see the parsed configuration.
    sudo containerd config dump
    
    If there are errors, correct the config.toml file. A common mistake is incorrect TOML syntax or invalid plugin configurations.
  • Check File Permissions: Ensure the containerd user and group have read/write access to /var/lib/containerd and /var/run/containerd.

3. Disk Space Full on /var/lib/containerd or /var/lib/docker (if used)

Diagnosis: Check available disk space on the partitions where containerd stores its data and where images are pulled.

df -h /var/lib/containerd
df -h /var/lib/docker # If you are using Docker as the runtime alongside containerd or transitioning

If these partitions are at 100% usage, containerd cannot write new state or download new images.

Fix:

  • Clean Up Unused Images/Containers: Use ctr (containerd’s native client) to clean up unused images and containers.
    sudo ctr image prune -a
    sudo ctr container prune
    
    If you have Docker installed, you might also use:
    sudo docker system prune -a --volumes
    
  • Remove Old Log Files: Check /var/log and other system directories for large, old log files that can be safely deleted or rotated.
  • Expand Disk/Partition: If cleanup isn’t enough, you’ll need to resize the disk or partition.

4. Incompatible containerd Version or Kernel Mismatch

Diagnosis: Check the containerd version and compare it with the Kubernetes version requirements. Also, check kernel version and containerd compatibility.

containerd --version
uname -r

Refer to the Kubernetes and containerd documentation for version compatibility matrices. Sometimes, a very new kernel feature might not be supported by an older containerd, or vice-versa.

Fix:

  • Upgrade/Downgrade containerd: If an incompatibility is found, upgrade or downgrade containerd to a version that is compatible with your Kubernetes version and kernel. Follow the official installation guides for your distribution.
  • Upgrade Kernel: If containerd requires a newer kernel feature, consider upgrading your node’s kernel.

5. Network Issues Preventing Communication with the Kubernetes API Server

Diagnosis: If containerd can start but can’t register with the Kubernetes API server, pods won’t be scheduled. Check containerd logs for errors related to gRPC communication with the Kubernetes API.

sudo journalctl -u containerd -n 500 --no-pager

Look for messages like "failed to dial API server," "connection refused," or TLS handshake errors.

Fix:

  • Check Node Network Connectivity: Ensure the node can reach the Kubernetes API server IP and port (usually 6443).
    curl -k https://<KUBERNETES_API_SERVER_IP>:6443/version
    
  • Verify Firewall Rules: Ensure no firewalls (node-level iptables/firewalld or network firewalls) are blocking traffic from the node to the API server.
  • Check containerd CRI Configuration: Ensure the containerd configuration (/etc/containerd/config.toml) correctly points to the Kubernetes API endpoint. This is usually handled by the Kubernetes installation process (e.g., kubeadm) which configures containerd via containerd-shim-runc-v2. The critical part is ensuring the node’s kubelet can communicate with containerd’s gRPC endpoint, typically via a Unix socket at /run/containerd/containerd.sock.

6. containerd Plugin or Runtime Issues (e.g., runc)

Diagnosis: containerd relies on runtimes like runc to create containers. If runc is misconfigured or corrupted, containerd might fail to start or create containers.

sudo journalctl -u containerd -n 500 --no-pager

Look for errors mentioning runc, failed to create shim task, or similar low-level runtime errors.

Fix:

  • Reinstall or Update runc: If runc is the problem, try reinstalling or updating it.
    # For Debian/Ubuntu
    sudo apt-get update
    sudo apt-get install --reinstall runc
    
    # For CentOS/RHEL/Fedora
    sudo yum reinstall runc # or dnf reinstall runc
    
    Then restart containerd.
  • Check containerd Configuration for Runtime: Ensure /etc/containerd/config.toml correctly specifies the path to the runc executable under [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]. The default is usually fine.

After resolving these issues and restarting containerd, you might encounter issues with the kubelet service if it was also affected by the containerd failure or if its configuration is now out of sync.

sudo systemctl status kubelet

Want structured learning?

Take the full Containerd course →