containerd’s control plane, the containerd daemon, is failing to accept new connections from the Kubernetes kubelet, causing nodes to become unhealthy.

Common Causes & Fixes

  1. Outdated containerd Version with Known Bugs:

    • Diagnosis: Check the installed containerd version on your nodes.
      sudo ctr version
      
      Compare this to the release notes of newer containerd versions for critical bug fixes related to API stability or resource handling.
    • Fix: Upgrade containerd to a stable, supported version. For example, to upgrade to v1.6.24:
      # Download the binary
      wget https://github.com/containerd/containerd/releases/download/v1.6.24/containerd-1.6.24.linux-amd64.tar.gz
      # Extract binaries
      sudo tar --extract --rewrite --directory=/usr/local containerd-1.6.24.linux-amd64.tar.gz
      # Restart containerd
      sudo systemctl restart containerd
      
    • Why it works: Newer versions often contain patches that address race conditions or memory leaks in the API server, preventing it from crashing or becoming unresponsive.
  2. Insufficient File Descriptors:

    • Diagnosis: Check the number of open file descriptors for the containerd process.
      sudo lsof -p $(pgrep containerd) | wc -l
      
      If this number is close to the system limit (e.g., /proc/sys/fs/file-max or per-process limits), containerd might not be able to open new connections.
    • Fix: Increase the file descriptor limit for the containerd service. Edit /etc/systemd/system/containerd.service.d/override.conf (create if it doesn’t exist):
      [Service]
      LimitNOFILE=65536
      
      Then reload systemd and restart containerd:
      sudo systemctl daemon-reload
      sudo systemctl restart containerd
      
    • Why it works: containerd uses file descriptors for network sockets and file operations. Exceeding the limit prevents it from establishing new connections or accessing necessary files.
  3. Configured Max Concurrent API Calls Exceeded:

    • Diagnosis: Examine containerd’s configuration for limits on concurrent API requests. The relevant setting is typically within /etc/containerd/config.toml under [grpc.max_concurrent_streams].
    • Fix: Increase the max_concurrent_streams value. For instance, if it’s set to 1024, you might increase it to 2048:
      [grpc]
        max_concurrent_streams = 2048
      
      After editing config.toml, restart containerd:
      sudo systemctl restart containerd
      
    • Why it works: This limit directly controls how many requests containerd’s gRPC server can handle simultaneously. Increasing it allows more concurrent requests from kubelet and other clients.
  4. Resource Starvation (CPU/Memory):

    • Diagnosis: Monitor containerd’s resource usage using tools like top, htop, or cAdvisor. Look for high CPU utilization or memory consumption that might be causing it to become unresponsive.
      # Example using cAdvisor (if deployed)
      # Access cAdvisor UI and filter by container runtime processes
      
    • Fix: Allocate more resources to the node or optimize workloads. If containerd is running on a resource-constrained node, consider moving it to a larger instance or adjusting the node’s resource requests/limits. For specific containerd resource limits (if managed by systemd), check /etc/systemd/system/containerd.service.
    • Why it works: When containerd is starved of CPU or memory, its processes can slow down, become unresponsive, or even be OOM-killed, leading to connection failures.
  5. Network Connectivity Issues Between Kubelet and Containerd:

    • Diagnosis: Verify that the kubelet can reach the containerd API socket. The default socket is /run/containerd/containerd.sock.
      # From the node where kubelet is running, attempt to connect using ctr
      sudo ctr --address /run/containerd/containerd.sock version
      
      If this fails, check firewall rules, SELinux/AppArmor policies, or network namespaces.
    • Fix: Ensure the kubelet service user has read/write permissions to the containerd socket. Typically, this involves adding the kubelet user to a group that has access, or adjusting permissions on the socket file itself. Ensure no firewall rules are blocking local socket communication.
    • Why it works: Kubelet communicates with containerd via a Unix domain socket. If permissions are incorrect or the socket is inaccessible, kubelet cannot send commands or receive status updates.
  6. Corrupted containerd State Files:

    • Diagnosis: Inspect containerd’s state directories, typically /var/lib/containerd/. Look for unusual file sizes, permission errors, or evidence of partial writes.
    • Fix: While risky, in extreme cases, a corrupted state can be resolved by stopping containerd, backing up the state directory, and then removing it before restarting containerd. Note: This will cause containerd to lose track of all running containers and images, requiring a full re-pull and restart of workloads.
      sudo systemctl stop containerd
      sudo mv /var/lib/containerd /var/lib/containerd.bak_$(date +%s)
      sudo systemctl start containerd
      
    • Why it works: Corrupted internal data structures prevent containerd from initializing correctly or accessing its managed resources, leading to API failures.

After fixing these issues, you might encounter: Failed to get a kubelet client: dial-out error from kubelet: dial tcp 127.0.0.1:10255: connect: connection refused.

Want structured learning?

Take the full Containerd course →