Troubleshoot Cilium Connectivity: Endpoints, Policies, DNS (2026)

Cilium’s fundamental unit of network policy enforcement isn’t pods, but individual network endpoints, and understanding the lifecycle of these endpoints is the key to debugging connectivity issues.

Let’s see how this plays out with a simple example. Imagine we have a frontend deployment and a backend deployment, and we want to allow the frontend to talk to the backend on TCP port 8080.

Here’s a simplified frontend pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: frontend-pod-abcde
  labels:
    app: frontend
spec:
  containers:
  - name: frontend
    image: nginx:latest
    ports:
    - containerPort: 80

And a similarly simplified backend pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: backend-pod-fghij
  labels:
    app: backend
spec:
  containers:
  - name: backend
    image: httpd:latest
    ports:
    - containerPort: 8080

Now, let’s apply a basic Cilium Network Policy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: frontend-to-backend
spec:
  endpointSelector:
    matchLabels:
      app: frontend
  egress:
  - toEndpoints:
    - matchLabels:
        app: backend
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP

When frontend-pod-abcde starts, Cilium doesn’t just see a pod. It sees a network endpoint. This endpoint gets an identity, a numerical ID that Cilium uses for policy enforcement. You can see this using cilium endpoint list:

ID   Identity   k8sNamespace   k8sPodName         Labels
1    1000       default          frontend-pod-abcde   app=frontend
2    2000       default          backend-pod-fghij    app=backend

Here, frontend-pod-abcde has identity 1000 and backend-pod-fghij has identity 2000. The policy we wrote translates to: "Allow egress from endpoint with identity 1000 to endpoint with identity 2000 on TCP port 8080."

The real magic happens when you want to debug connectivity. When frontend-pod-abcde tries to reach backend-pod-fghij and it fails, you’re not just looking at IP addresses and ports. You’re looking at whether the endpoint identities are correctly mapped and whether the policy rules associated with those identities are being hit.

The most common failure point is often the endpoint’s identity. If a pod starts and Cilium doesn’t assign it an identity, or if it gets the wrong identity, policies won’t apply correctly.

Troubleshooting Connectivity: Common Pitfalls and Fixes

When frontend-pod-abcde can’t connect to backend-pod-fghij on TCP port 8080, here’s where to look:

Endpoint Identity Missing or Incorrect:
- Diagnosis: Run cilium endpoint list. Check if both frontend-pod-abcde and backend-pod-fghij appear. Verify their assigned identities. If an endpoint is missing, or its labels are wrong, it won’t have the correct identity.
- Cause: The Cilium agent (daemonset) on the node where the pod is running might be unhealthy, or there’s a race condition during pod startup where the agent hasn’t processed the pod’s labels yet. Sometimes, incorrect or conflicting k8s:io.kubernetes.pod.name labels can cause issues.
- Fix:
  - If the Cilium agent is unhealthy, check its logs: kubectl logs <cilium-agent-pod> -n kube-system. Restart the agent pod if necessary: kubectl delete pod <cilium-agent-pod> -n kube-system.
  - If labels are the issue, ensure your pod specs have consistent and correct labels that match your CiliumNetworkPolicy selectors. For example, if app: frontend is missing from the frontend pod, Cilium won’t be able to select it.
  - If an identity is missing, ensure the pod has been running for a sufficient time. Sometimes, a simple kubectl delete pod <pod-name> and letting the deployment controller recreate it can resolve transient identity assignment issues.
- Why it works: Cilium uses these identities to map network traffic to policy rules. Without a correct identity, the traffic is essentially "unknown" to Cilium and might be dropped by default.
Policy Selectors Not Matching Pod Labels:
- Diagnosis: Run cilium network policy get frontend-to-backend and compare the endpointSelector and toEndpoints labels against the actual labels on your pods (kubectl get pod <pod-name> --show-labels).
- Cause: Typos, case mismatches, or missing labels in either the CiliumNetworkPolicy or the pod’s metadata.
- Fix: Correct the labels in your CiliumNetworkPolicy or pod definitions to ensure they match exactly. For example, ensure app: frontend in the policy matches app: frontend on the pod.
- Why it works: Policy enforcement relies on matching selectors. If the selector app: backend in the toEndpoints section doesn’t find any pods with that label, the policy rule permitting egress to the backend will never be active.
Egress Rule Missing or Incorrect Port/Protocol:
- Diagnosis: Examine the egress section of your CiliumNetworkPolicy. Ensure the toPorts block correctly specifies the destination port (8080) and protocol (TCP). Use cilium policy get <policy-name> to see how Cilium interprets the policy.
- Cause: The policy might be written for a different port, or the protocol is specified incorrectly (e.g., udp instead of tcp). Also, if you have a default-deny policy in place, even if a pod has an identity, it won’t be able to communicate without an explicit allow rule.
- Fix: Update the toPorts section of your CiliumNetworkPolicy to accurately reflect the destination port and protocol. For example:
```
egress:
- toEndpoints:
  - matchLabels:
      app: backend
  toPorts:
  - ports:
    - port: "8080" # Ensure this is the correct port
      protocol: TCP # Ensure this is the correct protocol
```
- Why it works: Cilium inspects traffic at the network layer and applies rules based on port and protocol. An incorrect specification means the rule simply won’t match the traffic.
Service Discovery Issues (DNS or ClusterIP):
- Diagnosis: If your frontend is trying to reach a Kubernetes Service for the backend (e.g., http://backend-service:8080), check if DNS is resolving correctly (nslookup backend-service). Check if the backend-service ClusterIP is reachable and if pods behind it are healthy. Use cilium service list to see how Cilium manages Kubernetes Services.
- Cause: Cilium’s DNS proxy might not be functioning correctly, or the Service definition itself is misconfigured. If the Service is pointing to no healthy endpoints, traffic will be dropped.
- Fix:
  - Ensure your DNS service (like CoreDNS) is running and healthy.
  - Check your backend-service definition: kubectl get svc backend-service -o yaml. Ensure targetPort and port are correct, and that selector matches backend pod labels.
  - Verify backend pods are ready and listening on the correct port.
  - If using Cilium’s Hubble, you can visualize DNS requests and Service lookups.
- Why it works: Cilium integrates with Kubernetes Services. If the Service abstraction is broken (e.g., incorrect selectors, no healthy backend pods, DNS resolution failure), the traffic will never reach the intended destination, regardless of endpoint policies.
Network Policy Scope (Namespace):
- Diagnosis: If your frontend and backend pods are in different namespaces, your CiliumNetworkPolicy needs to specify namespaceSelector or namespace in the toEndpoints section.
- Cause: By default, CiliumNetworkPolicy applies within the same namespace. If pods are in different namespaces, a policy without cross-namespace awareness will not permit traffic.
- Fix: Modify your policy to include the target namespace. For example, if backend is in the prod namespace:
```
spec:
  endpointSelector:
    matchLabels:
      app: frontend
  egress:
  - toEndpoints:
    - matchLabels:
        app: backend
      # Explicitly specify the namespace if different
      namespace: prod
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
```
  Or, more robustly, using namespaceSelector:
```
spec:
  endpointSelector:
    matchLabels:
      app: frontend
  egress:
  - toEndpoints:
    - matchLabels:
        app: backend
      namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: prod # Or a custom label on the namespace
    toPorts:
    - ports:
      - port: "8080"
        protocol: TCP
```
- Why it works: Cilium policies are namespace-aware. Explicitly referencing the target namespace or selecting it via labels ensures the policy applies across namespace boundaries.
Underlying Network/Node Issues:
- Diagnosis: Can pods on the same node communicate? Can pods on different nodes communicate? Use ping or curl from within the pods. Check node-level network connectivity (e.g., ip a, ip route, iptables rules on the node).
- Cause: Network misconfiguration on the host nodes, issues with the CNI overlay network (e.g., VXLAN, Geneve), firewall rules on the nodes, or even physical network problems.
- Fix: Troubleshoot standard network connectivity issues at the node level. Ensure your overlay network is functioning correctly. Check node firewalls.
- Why it works: Cilium builds upon the host’s network stack. If the fundamental network layer between nodes is broken, Cilium policies cannot bridge that gap.

Once all these are resolved, the next error you’ll likely encounter is a "connection refused" on the destination port if the application on the backend pod isn’t actually listening, or a timeout if the network path is still somehow blocked at a lower level (less likely with Cilium properly configured).