When a Kubernetes workflow pod finishes its task, it doesn’t just vanish into thin air. Kubernetes needs to know that it’s done so it can reclaim the resources. If this cleanup process gets stuck, you end up with a graveyard of completed pods cluttering your cluster, consuming IP addresses, and potentially causing issues for other services. The core problem is that the Kubernetes API server is receiving, but not effectively processing, requests to delete finished workflow pods.
The most common culprit is a backlog in the API server’s etcd writes. When a pod transitions to a Succeeded or Failed state, the kubelet on the node reports this back to the API server. The API server then writes this new state to etcd. If etcd is slow or overloaded, these writes get delayed, meaning the API server doesn’t get the confirmation that the pod is truly finished.
Diagnosis: Check the etcd pods for high CPU or memory usage, or look for etcd logs indicating slow writes or network issues between etcd members. You can also check the API server logs for messages related to etcd timeouts or high latency.
Fix: Increase etcd’s performance. This might involve adding more etcd nodes to the cluster, increasing the resources (CPU/RAM) allocated to existing etcd pods, or tuning etcd’s configuration (e.g., heartbeat-interval, election-timeout). For example, if you’re running etcd in static pods, you might edit its manifest to resources: limits: cpu: "2000m" memory: "4Gi" requests: cpu: "1000m" memory: "2Gi".
Why it works: A faster etcd can process the state updates from the API server more quickly, allowing the control plane to acknowledge the pod’s completion and initiate garbage collection.
Another frequent issue is the kube-controller-manager getting overwhelmed. This component is responsible for a myriad of things, including the lifecycle of pods and garbage collection. If it’s too busy, it might not be picking up completed pods for deletion efficiently.
Diagnosis: Monitor the kube-controller-manager pod for high CPU utilization. Check its logs for warnings or errors related to leader election or slow reconciliation loops.
Fix: Scale up the kube-controller-manager. If it’s running as a static pod, increase its resource requests and limits in its manifest file. For instance, change resources: limits: cpu: "1000m" memory: "500Mi" requests: cpu: "500m" memory: "250Mi" to resources: limits: cpu: "2000m" memory: "1Gi" requests: cpu: "1000m" memory: "500Mi".
Why it works: Giving the controller manager more processing power allows it to keep up with the rate of pod state changes and execute its garbage collection duties promptly.
Network instability between the kubelet and the API server can also cause problems. If the kubelet can’t reliably report a pod’s terminal state (Succeeded or Failed), the API server won’t know to start the cleanup process.
Diagnosis: Examine kubelet logs on the nodes where the completed pods are running for network errors, timeouts, or connection refused messages when trying to communicate with the API server. Use ping or traceroute from the node to the API server’s IP.
Fix: Troubleshoot network connectivity. This could involve checking firewall rules, ensuring proper routing between nodes and the control plane, or restarting network components on the nodes. A simple fix might be ensuring the kubelet configuration points to the correct API server endpoint.
Why it works: Reliable network communication ensures that the API server receives timely updates on pod status, enabling it to initiate garbage collection.
Resource exhaustion on the nodes themselves can prevent the kubelet from properly signaling pod completion. If a node is out of CPU or memory, the kubelet process might become unresponsive or unable to perform its duties.
Diagnosis: Check node resource utilization using kubectl top nodes or by SSHing into the nodes and running top or htop. Look for nodes with sustained high CPU or memory usage.
Fix: Add more resources to the affected nodes (e.g., by upgrading hardware or adding more nodes to the cluster) or optimize resource usage by other pods running on those nodes.
Why it works: Sufficient node resources ensure the kubelet can operate correctly and report pod status updates without interruption.
Sometimes, the issue isn’t with the core components but with specific configurations of the workflow pod itself. If a pod is configured with an extremely long terminationGracePeriodSeconds, it will linger in a Terminating state for that duration even after its process has exited, making it appear to the system as if it’s still running.
Diagnosis: Inspect the YAML definition of the problematic workflow pods. Look for a terminationGracePeriodSeconds value that is unusually high (e.g., days or weeks).
Fix: Set terminationGracePeriodSeconds to a reasonable value, typically 30 seconds or less, unless there’s a very specific, documented reason for a longer period. For example, terminationGracePeriodSeconds: 30.
Why it works: A shorter grace period allows the kubelet to forcefully terminate the pod and report its completion to the API server much faster, thus expediting garbage collection.
Finally, a misconfiguration or bug in the specific workflow controller or operator managing these pods can lead to orphaned pods. For instance, a controller might fail to update the pod’s status correctly or might incorrectly believe the pod is still active.
Diagnosis: Review the logs of the custom controller or operator responsible for managing these workflow pods. Look for errors in status updates or reconciliation loops.
Fix: Update or reconfigure the custom controller/operator. This might involve applying a patch, correcting its reconciliation logic, or ensuring it correctly uses Kubernetes APIs to manage pod lifecycles.
Why it works: A correctly functioning controller ensures that pod statuses are accurately reflected and that the control plane is aware when a pod has genuinely completed its work.
Once these completed pods are finally garbage collected, you might then encounter Too many open files errors in your application logs if the underlying system or application was trying to manage a very large number of file descriptors related to those pods.