NodeNotReady means your Kubernetes worker node is registered with the EKS control plane but isn’t healthy enough to run pods. The control plane can’t schedule new pods onto it, and existing pods might be evicted.
Common Causes and Fixes:
-
Kubelet Not Running or Unresponsive: The kubelet is the primary agent on each node that ensures containers are running in a Pod. If it crashes or becomes unresponsive, the node can’t report its status.
- Diagnosis: SSH into the node and check the kubelet service status:
sudo systemctl status kubelet. Look for recent errors in the logs:sudo journalctl -u kubelet -f. - Fix: If kubelet is inactive or exited, try restarting it:
sudo systemctl restart kubelet. If it consistently fails, investigate the logs for specific errors (e.g., disk pressure, network issues, configuration problems). Often, a node reboot can resolve transient kubelet issues. - Why it works: The kubelet is the node’s voice to the control plane. Restarting it allows it to re-establish communication and report a healthy status if the underlying issue is resolved.
- Diagnosis: SSH into the node and check the kubelet service status:
-
Network Connectivity Issues: The node needs to communicate with the EKS control plane (API server) and other nodes/pods in the cluster. Firewalls, security groups, or routing problems can block this.
- Diagnosis: From the node, try
curl -v https://<your-eks-cluster-endpoint>. Check your node’s security group to ensure it allows outbound traffic to the EKS control plane’s endpoint and port (typically 443). Verify network ACLs and route tables for your VPC. - Fix: Adjust security group rules to allow necessary outbound traffic. Ensure your VPC route tables are correctly configured to reach the EKS endpoint.
- Why it works: EKS nodes must be able to reach the Kubernetes API server to register, receive pod definitions, and report their status.
- Diagnosis: From the node, try
-
Insufficient Node Resources (CPU/Memory/Disk): If a node runs out of critical resources, the kubelet or other essential system processes can fail, leading to
NodeNotReady.- Diagnosis: SSH into the node and check resource utilization:
top,htop, orfree -h. For disk space, usedf -h. - Fix:
- CPU/Memory: Scale up your EC2 instance type or add more nodes to the EKS node group.
- Disk: Clean up unused files (e.g., old Docker images, logs) using
docker system prune -a(use with caution) orsudo rm -rf /var/lib/docker/overlay2/*(also with caution, this removes all container layers). If/var/lib/kubeletis full,sudo rm -rf /var/lib/kubelet/pods/*can clear old pod data.
- Why it works: Kubelet and container runtimes require sufficient resources to operate. Releasing these resources allows them to function correctly.
- Diagnosis: SSH into the node and check resource utilization:
-
Security Group/IAM Policy Issues: The EC2 instance profile role associated with your EKS nodes needs specific IAM permissions to interact with EKS and other AWS services. Incorrect policies will prevent the kubelet from registering or communicating properly.
- Diagnosis: Check the IAM role attached to your EC2 instances. Ensure it has the
AmazonEKSWorkerNodePolicyandAmazonEC2ContainerRegistryReadOnlymanaged policies attached, and a trust relationship allowingeks.amazonaws.comto assume the role. Also, check the cluster’s API server access policies to ensure your node’s security group is allowed. - Fix: Attach the necessary managed policies to the instance role and update the trust relationship if needed. Modify the EKS cluster’s API server access policy to include the node security group.
- Why it works: The IAM role grants the node the necessary permissions to act as a worker node in EKS, and the API server access policy allows it to be trusted by the control plane.
- Diagnosis: Check the IAM role attached to your EC2 instances. Ensure it has the
-
Corrupted
kubeletConfiguration or Certificates: Incorrectly configuredkubeletflags or expired/invalid certificates can prevent it from starting or communicating.- Diagnosis: Examine
/var/lib/kubelet/config.yamland/etc/kubernetes/kubelet.conf. Look for misconfigurations. Check certificate expiry usingsudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddate. (Note: The exact paths might vary slightly based on installation method). - Fix: Correct any errors in
kubelet.conforconfig.yaml. If certificates are expired, they typically need to be reissued by the EKS control plane, which might involve recreating the node group or instance. - Why it works: Kubelet relies on valid configuration and secure communication channels (certificates) to establish and maintain its connection with the API server.
- Diagnosis: Examine
-
EKS Control Plane Issues: While less common, the EKS control plane itself might be experiencing issues that prevent nodes from registering or staying ready.
- Diagnosis: Check the AWS Health Dashboard for EKS service issues in your region. Look at EKS cluster logs if enabled.
- Fix: If it’s a control plane issue, you’ll likely need to wait for AWS to resolve it.
- Why it works: The control plane is the brain of EKS; if it’s unhealthy, worker nodes cannot function correctly within the cluster.
After fixing these, you might next encounter ImagePullBackOff if your pods can’t pull their container images.
CrashLoopBackOff indicates that a container in your pod is repeatedly crashing and restarting. Kubernetes is trying to restart it but is backing off between attempts to avoid overwhelming the system.
Common Causes and Fixes:
-
Application Errors/Bugs: The most frequent cause is an unhandled exception or error within your application code that causes it to exit.
- Diagnosis: Get the logs for the crashing pod:
kubectl logs <pod-name> -c <container-name>. If the pod is constantly restarting, you might need to usekubectl logs <pod-name> -c <container-name> --previousto see logs from the last time it ran. - Fix: Debug your application code based on the error messages in the logs. Fix the bug and redeploy the container image.
- Why it works: Eliminating the bug prevents the application from crashing, allowing it to run successfully.
- Diagnosis: Get the logs for the crashing pod:
-
Missing or Incorrect Configuration: Your application might be failing because it can’t find a required configuration file, environment variable, or secret, or these are misconfigured.
- Diagnosis: Check the pod’s events (
kubectl describe pod <pod-name>) and logs. Verify that all necessaryConfigMaps,Secrets, and environment variables are correctly defined in your pod spec and that the application is reading them as expected. - Fix: Ensure
ConfigMapsandSecretsare correctly populated and mounted into the pod. Verify environment variable names and values. - Why it works: Providing the application with the correct configuration allows it to start up and operate as intended.
- Diagnosis: Check the pod’s events (
-
Health Check Failures (Liveness/Readiness Probes): If your liveness probe fails repeatedly, Kubernetes will kill and restart the container.
- Diagnosis: Run
kubectl describe pod <pod-name>. Look at the "Events" section for messages related to liveness probe failures. Check the probe configuration (livenessProbein your pod spec) and ensure the endpoint/command it uses is actually working and responding within thetimeoutSeconds. - Fix: Adjust the
livenessProbesettings (e.g., increaseinitialDelaySeconds,timeoutSeconds,periodSeconds) or fix the application endpoint/command that the probe is checking. - Why it works: A healthy liveness probe confirms to Kubernetes that the application is running correctly, preventing unnecessary restarts.
- Diagnosis: Run
-
Insufficient Resources (CPU/Memory Limits): If your container exceeds its defined CPU or memory limits, it can be terminated by the Kubelet (OOMKilled for memory).
- Diagnosis: Check
kubectl describe pod <pod-name>for events indicating OOMKilled (Out Of Memory). You can also check node-level logs (journalctl -u kubeleton the node) for resource-related terminations. - Fix: Increase the
resources.limits.cpuandresources.limits.memoryin your pod specification. Alternatively, optimize your application to use fewer resources. - Why it works: Allowing the container to consume more resources (or use fewer) prevents the system from killing it due to resource starvation.
- Diagnosis: Check
-
Incorrect Command or Entrypoint: The
commandorargsspecified in your container definition might be incorrect, leading to an immediate exit upon startup.- Diagnosis: Examine the
commandandargsin your pod spec. Try running the container locally with the same command to see if it works. Checkkubectl logs <pod-name> --previousfor any immediate startup errors. - Fix: Correct the
commandorargsin your deployment/pod spec to accurately reflect how your application should be started. - Why it works: Ensures the container executes the correct process upon starting.
- Diagnosis: Examine the
-
Permissions Issues (e.g., File Access): The user running inside the container might not have the necessary permissions to access files or directories it needs to start or operate.
- Diagnosis: Check application logs for "Permission denied" errors. If your container runs as a non-root user, ensure that user has read/write access to necessary volumes or files.
- Fix: Adjust file permissions within your container image or modify the
securityContextin your pod spec to grant appropriate user/group IDs or permissions. - Why it works: Allows the application process to perform necessary file operations.
-
Image Issues: The container image itself might be corrupt, or the entrypoint script within the image might have an error.
- Diagnosis: Try pulling the image manually on the node (
docker pull <image-name>:<tag>). Check the Dockerfile for the image. - Fix: Rebuild the container image, ensuring the Dockerfile is correct and the entrypoint script executes without errors.
- Why it works: A valid and correctly built image is fundamental for a container to start.
- Diagnosis: Try pulling the image manually on the node (
Once CrashLoopBackOff is resolved, you’ll likely see ImagePullBackOff if you’ve recently updated an image and the registry is inaccessible or the image doesn’t exist.