Cilium’s fundamental unit of network policy enforcement isn’t pods, but individual network endpoints, and understanding the lifecycle of these endpoints is the key to debugging connectivity issues.
Let’s see how this plays out with a simple example. Imagine we have a frontend deployment and a backend deployment, and we want to allow the frontend to talk to the backend on TCP port 8080.
Here’s a simplified frontend pod spec:
apiVersion: v1
kind: Pod
metadata:
name: frontend-pod-abcde
labels:
app: frontend
spec:
containers:
- name: frontend
image: nginx:latest
ports:
- containerPort: 80
And a similarly simplified backend pod spec:
apiVersion: v1
kind: Pod
metadata:
name: backend-pod-fghij
labels:
app: backend
spec:
containers:
- name: backend
image: httpd:latest
ports:
- containerPort: 8080
Now, let’s apply a basic Cilium Network Policy:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: frontend-to-backend
spec:
endpointSelector:
matchLabels:
app: frontend
egress:
- toEndpoints:
- matchLabels:
app: backend
toPorts:
- ports:
- port: "8080"
protocol: TCP
When frontend-pod-abcde starts, Cilium doesn’t just see a pod. It sees a network endpoint. This endpoint gets an identity, a numerical ID that Cilium uses for policy enforcement. You can see this using cilium endpoint list:
ID Identity k8sNamespace k8sPodName Labels
1 1000 default frontend-pod-abcde app=frontend
2 2000 default backend-pod-fghij app=backend
Here, frontend-pod-abcde has identity 1000 and backend-pod-fghij has identity 2000. The policy we wrote translates to: "Allow egress from endpoint with identity 1000 to endpoint with identity 2000 on TCP port 8080."
The real magic happens when you want to debug connectivity. When frontend-pod-abcde tries to reach backend-pod-fghij and it fails, you’re not just looking at IP addresses and ports. You’re looking at whether the endpoint identities are correctly mapped and whether the policy rules associated with those identities are being hit.
The most common failure point is often the endpoint’s identity. If a pod starts and Cilium doesn’t assign it an identity, or if it gets the wrong identity, policies won’t apply correctly.
Troubleshooting Connectivity: Common Pitfalls and Fixes
When frontend-pod-abcde can’t connect to backend-pod-fghij on TCP port 8080, here’s where to look:
-
Endpoint Identity Missing or Incorrect:
- Diagnosis: Run
cilium endpoint list. Check if bothfrontend-pod-abcdeandbackend-pod-fghijappear. Verify their assigned identities. If an endpoint is missing, or its labels are wrong, it won’t have the correct identity. - Cause: The Cilium agent (daemonset) on the node where the pod is running might be unhealthy, or there’s a race condition during pod startup where the agent hasn’t processed the pod’s labels yet. Sometimes, incorrect or conflicting
k8s:io.kubernetes.pod.namelabels can cause issues. - Fix:
- If the Cilium agent is unhealthy, check its logs:
kubectl logs <cilium-agent-pod> -n kube-system. Restart the agent pod if necessary:kubectl delete pod <cilium-agent-pod> -n kube-system. - If labels are the issue, ensure your pod specs have consistent and correct labels that match your
CiliumNetworkPolicyselectors. For example, ifapp: frontendis missing from thefrontendpod, Cilium won’t be able to select it. - If an identity is missing, ensure the pod has been running for a sufficient time. Sometimes, a simple
kubectl delete pod <pod-name>and letting the deployment controller recreate it can resolve transient identity assignment issues.
- If the Cilium agent is unhealthy, check its logs:
- Why it works: Cilium uses these identities to map network traffic to policy rules. Without a correct identity, the traffic is essentially "unknown" to Cilium and might be dropped by default.
- Diagnosis: Run
-
Policy Selectors Not Matching Pod Labels:
- Diagnosis: Run
cilium network policy get frontend-to-backendand compare theendpointSelectorandtoEndpointslabels against the actual labels on your pods (kubectl get pod <pod-name> --show-labels). - Cause: Typos, case mismatches, or missing labels in either the
CiliumNetworkPolicyor the pod’s metadata. - Fix: Correct the labels in your
CiliumNetworkPolicyor pod definitions to ensure they match exactly. For example, ensureapp: frontendin the policy matchesapp: frontendon the pod. - Why it works: Policy enforcement relies on matching selectors. If the selector
app: backendin thetoEndpointssection doesn’t find any pods with that label, the policy rule permitting egress to the backend will never be active.
- Diagnosis: Run
-
Egress Rule Missing or Incorrect Port/Protocol:
- Diagnosis: Examine the
egresssection of yourCiliumNetworkPolicy. Ensure thetoPortsblock correctly specifies the destination port (8080) andprotocol(TCP). Usecilium policy get <policy-name>to see how Cilium interprets the policy. - Cause: The policy might be written for a different port, or the protocol is specified incorrectly (e.g.,
udpinstead oftcp). Also, if you have a default-deny policy in place, even if a pod has an identity, it won’t be able to communicate without an explicit allow rule. - Fix: Update the
toPortssection of yourCiliumNetworkPolicyto accurately reflect the destination port and protocol. For example:egress: - toEndpoints: - matchLabels: app: backend toPorts: - ports: - port: "8080" # Ensure this is the correct port protocol: TCP # Ensure this is the correct protocol - Why it works: Cilium inspects traffic at the network layer and applies rules based on port and protocol. An incorrect specification means the rule simply won’t match the traffic.
- Diagnosis: Examine the
-
Service Discovery Issues (DNS or ClusterIP):
- Diagnosis: If your
frontendis trying to reach a KubernetesServicefor thebackend(e.g.,http://backend-service:8080), check if DNS is resolving correctly (nslookup backend-service). Check if thebackend-serviceClusterIP is reachable and if pods behind it are healthy. Usecilium service listto see how Cilium manages Kubernetes Services. - Cause: Cilium’s DNS proxy might not be functioning correctly, or the Service definition itself is misconfigured. If the Service is pointing to no healthy endpoints, traffic will be dropped.
- Fix:
- Ensure your DNS service (like CoreDNS) is running and healthy.
- Check your
backend-servicedefinition:kubectl get svc backend-service -o yaml. EnsuretargetPortandportare correct, and thatselectormatches backend pod labels. - Verify backend pods are ready and listening on the correct port.
- If using Cilium’s Hubble, you can visualize DNS requests and Service lookups.
- Why it works: Cilium integrates with Kubernetes Services. If the Service abstraction is broken (e.g., incorrect selectors, no healthy backend pods, DNS resolution failure), the traffic will never reach the intended destination, regardless of endpoint policies.
- Diagnosis: If your
-
Network Policy Scope (Namespace):
- Diagnosis: If your
frontendandbackendpods are in different namespaces, yourCiliumNetworkPolicyneeds to specifynamespaceSelectorornamespacein thetoEndpointssection. - Cause: By default,
CiliumNetworkPolicyapplies within the same namespace. If pods are in different namespaces, a policy without cross-namespace awareness will not permit traffic. - Fix: Modify your policy to include the target namespace. For example, if
backendis in theprodnamespace:
Or, more robustly, usingspec: endpointSelector: matchLabels: app: frontend egress: - toEndpoints: - matchLabels: app: backend # Explicitly specify the namespace if different namespace: prod toPorts: - ports: - port: "8080" protocol: TCPnamespaceSelector:spec: endpointSelector: matchLabels: app: frontend egress: - toEndpoints: - matchLabels: app: backend namespaceSelector: matchLabels: kubernetes.io/metadata.name: prod # Or a custom label on the namespace toPorts: - ports: - port: "8080" protocol: TCP - Why it works: Cilium policies are namespace-aware. Explicitly referencing the target namespace or selecting it via labels ensures the policy applies across namespace boundaries.
- Diagnosis: If your
-
Underlying Network/Node Issues:
- Diagnosis: Can pods on the same node communicate? Can pods on different nodes communicate? Use
pingorcurlfrom within the pods. Check node-level network connectivity (e.g.,ip a,ip route,iptablesrules on the node). - Cause: Network misconfiguration on the host nodes, issues with the CNI overlay network (e.g., VXLAN, Geneve), firewall rules on the nodes, or even physical network problems.
- Fix: Troubleshoot standard network connectivity issues at the node level. Ensure your overlay network is functioning correctly. Check node firewalls.
- Why it works: Cilium builds upon the host’s network stack. If the fundamental network layer between nodes is broken, Cilium policies cannot bridge that gap.
- Diagnosis: Can pods on the same node communicate? Can pods on different nodes communicate? Use
Once all these are resolved, the next error you’ll likely encounter is a "connection refused" on the destination port if the application on the backend pod isn’t actually listening, or a timeout if the network path is still somehow blocked at a lower level (less likely with Cilium properly configured).