Cilium’s observability isn’t just about seeing network traffic; it’s about understanding the distributed system’s health and performance through the lens of its network policy enforcement.
Let’s see what this looks like in practice. Imagine you’ve got a Kubernetes cluster running Cilium, and you want to visualize its metrics.
# First, ensure Prometheus is deployed and configured to scrape metrics
# Example Prometheus Operator ServiceMonitor configuration:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: cilium-metrics
namespace: kube-system # Or wherever Cilium is installed
spec:
selector:
matchLabels:
app.kubernetes.io/name: cilium
k8s-app: cilium-metrics # Cilium exposes metrics on this port
namespaceSelector:
matchNames:
- kube-system # Namespace where Cilium pods are running
endpoints:
- port: metrics
interval: 30s
scheme: http
With this ServiceMonitor in place, Prometheus will automatically discover and scrape the metrics exposed by Cilium. You can then access these metrics within Prometheus itself, for example, by querying cilium_network_packets_total.
Now, let’s get this into Grafana. You’ll need to add Prometheus as a data source in Grafana. Once that’s done, you can import pre-built Grafana dashboards for Cilium. A common dashboard ID is 10678 for "Cilium."
This dashboard will show you:
- Cilium Agent Status: Health and uptime of your Cilium agents.
- Network Policy Enforcement: Metrics on how many policies are being applied, dropped packets due to policy, etc.
- Datapath Performance: Packet rates, byte rates, and potential drops at the BPF datapath level.
- Service Load Balancing: Metrics related to kube-proxy replacement (if enabled) and service routing.
- Envoy Proxy Metrics: If you’re using Cilium’s integrated Envoy for L7 policy, you’ll see its request rates, latency, and error counts.
The fundamental problem Cilium monitoring solves is bridging the gap between Kubernetes object-level events and the actual network packet flow and policy enforcement happening underneath. Without this, debugging network issues often involves guesswork, correlating pod restarts with network outages without understanding why packets are being dropped or how services are being routed.
Internally, Cilium exposes its metrics via an HTTP endpoint, typically on port 9963 (though configurable), labeled as metrics. This endpoint serves data in Prometheus exposition format. The key components contributing to these metrics are:
- Cilium Agent: The main agent running on each node, responsible for BPF program management, policy enforcement, and L3/L4 networking.
- Cilium Operator: Manages cluster-wide aspects like IP address management (IPAM) and CRD lifecycle.
- Cilium Network Policies (CNPs) / Kubernetes Network Policies (KNP): The enforcement of these policies generates metrics for allowed, denied, and forwarded traffic.
- eBPF Programs: The core of Cilium, these programs attached to network interfaces and kernel events are instrumented to emit counters and gauges.
The cilium_network_packets_total metric, for instance, is a counter incremented by an eBPF program every time a packet traverses a specific point in the datapath, categorized by direction, interface, and policy decision (allowed, denied, etc.).
A common point of confusion is understanding the difference between metrics from the Cilium agent itself versus metrics from Envoy when used for L7 policies. Cilium agent metrics typically focus on L3/L4 connectivity, BPF operations, and general agent health. Envoy metrics, on the other hand, are specific to HTTP/gRPC requests, load balancing decisions within Envoy, and L7 policy matches. You’ll find distinct metric prefixes for each, like cilium_ for the agent and envoy_ for Envoy.
The next step is to start correlating these network metrics with application-level metrics to pinpoint performance bottlenecks or policy misconfigurations.