Cilium’s node health checks aren’t just about whether a node is up, they’re fundamentally about whether your Kubernetes cluster can reliably route traffic between its nodes.
Let’s see this in action. Imagine you have two nodes, node-1 and node-2, and you’re running a simple application on each.
# On node-1:
kubectl run nginx-1 --image=nginx --port=80 --expose
# On node-2:
kubectl run nginx-2 --image=nginx --port=80 --expose
Now, from node-1, you want to curl the service exposed by nginx-2. If Cilium’s health checks are misconfigured or failing, this curl might hang indefinitely or return connection refused, even if kubectl get nodes shows both as Ready.
# On node-1:
kubectl get svc nginx-2
# Let's say the ClusterIP is 10.100.20.30
curl http://10.100.20.30
# This *should* work if node-to-node connectivity is healthy.
# If it fails, Cilium node health checks are a prime suspect.
Cilium’s node health checks work by having each Cilium agent periodically probe essential network endpoints on other nodes. The primary mechanism is the node-probe component within the Cilium agent. It’s configured to send small UDP packets (or sometimes TCP probes, depending on configuration) to a specific port on the other nodes. If these probes consistently fail, Cilium marks that node as unhealthy from the perspective of network reachability, and the control plane will stop routing traffic to it. This is distinct from Kubernetes’ own node status, which might only check basic API server reachability.
The core of the configuration lies within the Cilium agent’s ConfigMap, typically named cilium-config in the kube-system namespace. You’ll find parameters like:
node-monitor.enabled: This flag, whentrue, enables the node monitoring subsystem, which includes the node health checks.node-monitor.interval: How often (in seconds) the Cilium agent should send probes to other nodes. A common value is5s.node-monitor.challenge-response: Whentrue, the node monitor uses a challenge-response mechanism for more robust verification, preventing spoofing and ensuring the target node is actually responding. This is generally recommended.node-monitor.grace-period: The duration (in seconds) after a node is first deemed unhealthy before it’s marked as completely unreachable. This prevents flapping. A value like60sis typical.node-monitor.policy-violation-mode: Determines how the node monitor reacts to policy violations detected during probing. Options includeDrop,Reject, orLog.Dropis the most common for basic connectivity checks.
To illustrate, here’s a snippet from a typical cilium-config ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
node-monitor.enabled: "true"
node-monitor.interval: "5s"
node-monitor.challenge-response: "true"
node-monitor.grace-period: "60s"
node-monitor.policy-violation-mode: "Drop"
# ... other Cilium configurations
Applying this configuration:
kubectl apply -f cilium-config.yaml
The Cilium agent will then restart (or reload its configuration) and begin its monitoring. You can observe the health status of other nodes from the perspective of a specific node by checking the agent’s logs or using Cilium’s debugging commands.
# On any node where the Cilium agent is running:
cilium monitor --node-mask <node-ip-of-target-node>
This command might show Node <node-ip> is healthy or Node <node-ip> is unhealthy messages, giving you direct insight into what Cilium sees.
The most surprising thing about Cilium’s node health checks is that they operate independently of Kubernetes’ own node status reporting. A node can be Ready in kubectl get nodes but still be marked unhealthy by Cilium if the underlying network path (e.g., BGP, VXLAN tunnel endpoint, direct IP reachability) is compromised. This separation is crucial for maintaining robust pod-to-pod communication, as Cilium is responsible for enforcing network policies and routing, not just basic node availability. It effectively adds a layer of network-centric health validation that Kubernetes itself doesn’t provide out-of-the-box for inter-node communication.
When troubleshooting, don’t just look at kubectl get nodes. Dig into the Cilium agent logs on the affected nodes and use cilium monitor to see the direct output of the node monitor. The health of your cluster’s network fabric is often a more granular concern than the basic health of the Kubernetes control plane components on a given node.
The next hurdle will be understanding how these node health checks interact with service routing and endpoint discovery when a node does become unhealthy.