Cilium’s datapath isn’t just a black box of eBPF; it’s a sophisticated system where network packets are intercepted, inspected, and rewritten on the fly, making traditional debugging tools like tcpdump only a partial solution.
Let’s say you’ve got pods that can’t talk to each other, and you’ve ruled out the obvious like incorrect service definitions or network policies. The problem is likely in how Cilium is programming the kernel’s network stack via eBPF.
Common Causes and Fixes for Datapath Connectivity Issues
-
Incorrect eBPF Program Attachment:
- Diagnosis: Cilium programs are attached to various kernel network hooks. If a program isn’t attached where expected, traffic won’t be processed correctly. Check the status of your eBPF programs using
bpftool prog list. Look for programs with names likecilium_l3_ingressorcilium_l4_xmitand verify their attach points (e.g.,xdpon an interface,cgroupon a network namespace). - Fix: This is usually a symptom of a larger Cilium agent issue. Restarting the Cilium agent pod (
kubectl delete pod -n kube-system cilium-<node-name>) often resolves this by forcing it to re-initialize and re-attach its eBPF programs. - Why it works: The agent is responsible for loading and attaching eBPF programs to the correct kernel hooks. A restart ensures a clean slate for this process.
- Diagnosis: Cilium programs are attached to various kernel network hooks. If a program isn’t attached where expected, traffic won’t be processed correctly. Check the status of your eBPF programs using
-
Datapath Synchronization Delays or Failures:
- Diagnosis: Cilium agents need to synchronize their desired datapath state (e.g., service translations, endpoint IPs, policy rules) with the actual eBPF programs running in the kernel. Check the Cilium agent logs for messages indicating datapath synchronization problems, often mentioning
StuckorFailed. You can also check the agent’s status endpoint:kubectl exec -n kube-system <cilium-pod-name> -- cilium status. Look forDatapath: OK. - Fix: Ensure your Kubernetes API server is reachable and healthy from the Cilium agent pods. If there are network issues between the agent and the API server, synchronization will fail. Increasing the agent’s
sync-interval(e.g.,spec.initContainers[0].envforCILIUM_EXTRA_ARGSto include--sync-interval 60s) can sometimes help if the network is intermittently flaky, but it’s a workaround, not a fix for underlying connectivity. - Why it works: A stable connection to the API server allows the agent to continuously receive updates and program the datapath correctly.
- Diagnosis: Cilium agents need to synchronize their desired datapath state (e.g., service translations, endpoint IPs, policy rules) with the actual eBPF programs running in the kernel. Check the Cilium agent logs for messages indicating datapath synchronization problems, often mentioning
-
IP Address Management (IPAM) Conflicts or Exhaustion:
- Diagnosis: Cilium assigns IPs to pods. If the IPAM configuration is incorrect (e.g.,
kube-proxy-replacement=strictwithout proper host IP configuration) or if the allocated IP pool is exhausted, new pods might not get IPs, or existing ones might have incorrect ones, breaking connectivity. Checkcilium statusfor IPAM details and look forIPAM: OK. Also, inspect pod IPs usingkubectl get pods -o wide. - Fix: Verify your
CiliumNetworkConfigfor the IPAM settings, especiallyipv4.allocatorandipv4.pool. If a pool is exhausted, you’ll need to expand it or reconfigure. For example, if you’re usingkubernetesIPAM, ensure yourcluster-pool-ipv4-cidrin the Cilium Helm values orCiliumNetworkConfigis sufficiently large. - Why it works: Proper IPAM ensures each pod gets a unique, routable IP address that the datapath can use for forwarding decisions.
- Diagnosis: Cilium assigns IPs to pods. If the IPAM configuration is incorrect (e.g.,
-
BGP Control Plane Issues (if using BGP):
- Diagnosis: If you’re using Cilium’s BGP capabilities to advertise pod CIDRs to your network, BGP peering issues can cause external connectivity problems. Check the BGP status within the Cilium agent:
kubectl exec -n kube-system <cilium-pod-name> -- cilium bgp dump peers. Look for established sessions and successful route advertisements. - Fix: Ensure your BGP router configuration matches the
BGPPeeringPolicydefined in your Cilium configuration, including AS numbers and neighbor IPs. Verify network reachability between the Cilium agent and your BGP peers. - Why it works: BGP is how Cilium tells your physical network how to route traffic to your pods. If BGP isn’t working, external routers won’t know where to send traffic.
- Diagnosis: If you’re using Cilium’s BGP capabilities to advertise pod CIDRs to your network, BGP peering issues can cause external connectivity problems. Check the BGP status within the Cilium agent:
-
eBPF Map Corruption or Incorrect State:
- Diagnosis: eBPF maps are key-value stores used by eBPF programs to maintain state (e.g., service backend translations, connection tracking). If a map gets corrupted or contains stale data, traffic can be misrouted. Use
bpftool map listto see available maps andbpftool map dump id <map_id>to inspect their contents. Look for maps likecilium_svc_maporcilium_conn_track. - Fix: Restarting the Cilium agent pod (
kubectl delete pod -n kube-system cilium-<node-name>) is the most common way to clear and re-populate these maps. In rare, persistent cases, you might need to manually clear specific maps usingbpftool map delete ..., but this is advanced and risky. - Why it works: Restarting the agent forces it to re-initialize the maps with the current state from the Kubernetes API.
- Diagnosis: eBPF maps are key-value stores used by eBPF programs to maintain state (e.g., service backend translations, connection tracking). If a map gets corrupted or contains stale data, traffic can be misrouted. Use
-
Network Policy Enforcement Misconfiguration:
- Diagnosis: While not strictly a "datapath failure," an overly restrictive or incorrectly configured network policy can appear as a connectivity issue. Use
cilium policy get <pod-name>to view policies applied to a pod. Then, usecilium monitor --pod <pod-name>to see if traffic is being dropped by policy enforcement. - Fix: Review your
CiliumNetworkPolicyorCiliumClusterwideNetworkPolicyresources. Ensure the selectors match the intended pods and that the egress/ingress rules accurately reflect the required communication. For example, if pod A needs to reach pod B on port 80, ensure there’s an ingress rule on pod B allowing traffic from pod A on port 80, and an egress rule on pod A allowing traffic to pod B on port 80. - Why it works: Network policies are enforced by eBPF programs that act as gatekeepers for traffic. If a policy denies traffic, the eBPF program drops the packet.
- Diagnosis: While not strictly a "datapath failure," an overly restrictive or incorrectly configured network policy can appear as a connectivity issue. Use
After fixing these, the next error you’ll likely encounter is a DNS resolution issue if your CNI isn’t correctly configured to handle DNS traffic or if CoreDNS itself is having problems.