The containerd daemon, essential for orchestrating containers on your Kubernetes nodes, has stopped responding, preventing new pods from starting and existing ones from being managed.
Here are the most common reasons this happens and how to fix them:
1. containerd Service Crashed Due to Resource Exhaustion
Diagnosis: Check the system journal for containerd errors.
sudo journalctl -u containerd -n 500 --no-pager
Look for messages indicating memory or CPU spikes, or out-of-memory (OOM) killer events. Also, check overall system resource usage:
top -bn1 | grep "Cpu(s)\|Mem"
If CPU is consistently near 100% or memory is nearly full, this is likely the culprit.
Fix:
- Increase Node Resources: If possible, scale up your node’s CPU or memory.
- Identify Resource Hogs: Use
docker stats(if Docker is also installed and running) orhtopto find processes consuming excessive resources. This could be othercontainerdprocesses, a rogue application container, or even system daemons. - Tune
containerdConfiguration: Ifcontainerditself is the hog, you might need to adjust its resource limits in/etc/containerd/config.toml. For example, to limit its CPU usage:
While[plugins."io.containerd.grpc.v1.cri".registry] # ... other config [plugins."io.containerd.grpc.v1.cri".registry.configs] [plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io"] tls_verify = false [plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io".auth] # ... auth configcontainerd’s core config doesn’t directly have CPU/memory limits for the daemon itself in a simple way, you can limit the resources available to the containers it runs. However, if the daemon itself is OOM-killing, it’s usually a system-level resource issue or a bug incontainerd. - Restart
containerd: After addressing resource issues, restart the service.
This allowssudo systemctl restart containerdcontainerdto re-initialize with available resources.
2. Corrupted containerd State or Configuration
Diagnosis: Examine the containerd log file for specific errors related to state files or configuration parsing.
sudo journalctl -u containerd -n 500 --no-pager
Look for errors like "failed to load state," "invalid configuration," or file access permission issues.
Fix:
- Reset
containerdState (Use with Caution): If you suspect state corruption, you can try stoppingcontainerd, backing up, and removing its state directory, then restarting.
This will causesudo systemctl stop containerd sudo mv /var/lib/containerd /var/lib/containerd.bak_$(date +%Y%m%d_%H%M%S) sudo systemctl start containerdcontainerdto rebuild its state from scratch. Any running containers will likely be terminated and will need to be recreated by Kubernetes. - Validate
containerdConfiguration: Ensure/etc/containerd/config.tomlis syntactically correct and follows the expected schema. You can usecontainerd config dumpto see the parsed configuration.
If there are errors, correct thesudo containerd config dumpconfig.tomlfile. A common mistake is incorrect TOML syntax or invalid plugin configurations. - Check File Permissions: Ensure the
containerduser and group have read/write access to/var/lib/containerdand/var/run/containerd.
3. Disk Space Full on /var/lib/containerd or /var/lib/docker (if used)
Diagnosis: Check available disk space on the partitions where containerd stores its data and where images are pulled.
df -h /var/lib/containerd
df -h /var/lib/docker # If you are using Docker as the runtime alongside containerd or transitioning
If these partitions are at 100% usage, containerd cannot write new state or download new images.
Fix:
- Clean Up Unused Images/Containers: Use
ctr(containerd’s native client) to clean up unused images and containers.
If you have Docker installed, you might also use:sudo ctr image prune -a sudo ctr container prunesudo docker system prune -a --volumes - Remove Old Log Files: Check
/var/logand other system directories for large, old log files that can be safely deleted or rotated. - Expand Disk/Partition: If cleanup isn’t enough, you’ll need to resize the disk or partition.
4. Incompatible containerd Version or Kernel Mismatch
Diagnosis: Check the containerd version and compare it with the Kubernetes version requirements. Also, check kernel version and containerd compatibility.
containerd --version
uname -r
Refer to the Kubernetes and containerd documentation for version compatibility matrices. Sometimes, a very new kernel feature might not be supported by an older containerd, or vice-versa.
Fix:
- Upgrade/Downgrade
containerd: If an incompatibility is found, upgrade or downgradecontainerdto a version that is compatible with your Kubernetes version and kernel. Follow the official installation guides for your distribution. - Upgrade Kernel: If
containerdrequires a newer kernel feature, consider upgrading your node’s kernel.
5. Network Issues Preventing Communication with the Kubernetes API Server
Diagnosis: If containerd can start but can’t register with the Kubernetes API server, pods won’t be scheduled. Check containerd logs for errors related to gRPC communication with the Kubernetes API.
sudo journalctl -u containerd -n 500 --no-pager
Look for messages like "failed to dial API server," "connection refused," or TLS handshake errors.
Fix:
- Check Node Network Connectivity: Ensure the node can reach the Kubernetes API server IP and port (usually 6443).
curl -k https://<KUBERNETES_API_SERVER_IP>:6443/version - Verify Firewall Rules: Ensure no firewalls (node-level
iptables/firewalldor network firewalls) are blocking traffic from the node to the API server. - Check
containerdCRI Configuration: Ensure thecontainerdconfiguration (/etc/containerd/config.toml) correctly points to the Kubernetes API endpoint. This is usually handled by the Kubernetes installation process (e.g., kubeadm) which configurescontainerdviacontainerd-shim-runc-v2. The critical part is ensuring the node’skubeletcan communicate withcontainerd’s gRPC endpoint, typically via a Unix socket at/run/containerd/containerd.sock.
6. containerd Plugin or Runtime Issues (e.g., runc)
Diagnosis: containerd relies on runtimes like runc to create containers. If runc is misconfigured or corrupted, containerd might fail to start or create containers.
sudo journalctl -u containerd -n 500 --no-pager
Look for errors mentioning runc, failed to create shim task, or similar low-level runtime errors.
Fix:
- Reinstall or Update
runc: Ifruncis the problem, try reinstalling or updating it.
Then restart# For Debian/Ubuntu sudo apt-get update sudo apt-get install --reinstall runc # For CentOS/RHEL/Fedora sudo yum reinstall runc # or dnf reinstall runccontainerd. - Check
containerdConfiguration for Runtime: Ensure/etc/containerd/config.tomlcorrectly specifies the path to theruncexecutable under[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]. The default is usually fine.
After resolving these issues and restarting containerd, you might encounter issues with the kubelet service if it was also affected by the containerd failure or if its configuration is now out of sync.
sudo systemctl status kubelet