Cilium’s metrics are designed to be scraped by Prometheus, but Prometheus can’t magically find them without explicit configuration.
Here’s how it works:
Cilium components, like the agent (cilium-agent) and the operator (cilium-operator), expose metrics on specific HTTP endpoints. Prometheus, a time-series database and monitoring system, needs to be told where to find these endpoints and how often to collect data from them.
Let’s see Cilium’s metrics in action. Imagine you have a Kubernetes cluster with Cilium installed. The cilium-agent pod, running on each node, is where most of the action happens. It manages network policies, IP address management, and the underlying network connectivity. It exposes its metrics on port 9090 (the default, but can be configured).
Here’s a snippet of what you might see if you curl that endpoint directly (assuming you have port-forwarded or have network access):
# HELP cilium_agent_cgroup_memory_bytes_total cgroup memory usage in bytes
# TYPE cilium_agent_cgroup_memory_bytes_total gauge
cilium_agent_cgroup_memory_bytes_total{cgroup="kubepods",namespace="default",pod="my-app-pod",container="my-app-container"} 1.2345e+08
# HELP cilium_bpf_map_entry_count number of entries in a BPF map
# TYPE cilium_bpf_map_entry_count gauge
cilium_bpf_map_entry_count{mapname="cilium_ipcache",node="worker-node-1"} 1500
# HELP cilium_network_policy_violations_total total number of network policy violations
# TYPE cilium_network_policy_violations_total counter
cilium_network_policy_violations_total{direction="ingress",from_identity="123",to_identity="456",rule_name="deny-all"} 5
Prometheus needs to be configured to scrape these. This is typically done via a Service and ServiceMonitor (if you’re using the Prometheus Operator) or a direct scrape_configs entry in your Prometheus configuration.
The ServiceMonitor Approach (with Prometheus Operator):
This is the most common and Kubernetes-native way.
-
Create a
Servicefor Cilium’s metrics: You need a KubernetesServicethat points to the Cilium agent pods and exposes the metrics port.apiVersion: v1 kind: Service metadata: name: cilium-metrics namespace: kube-system # Or wherever Cilium is installed labels: app: cilium release: cilium # If using Helm, this might match your release name spec: selector: io.kubernetes.pod.name: cilium-agent-xyz # This needs to match your cilium-agent pods ports: - name: metrics port: 9090 # The port Cilium agent exposes metrics on targetPort: 9090Self-correction: The
selectorabove is too specific. You want to select all cilium-agent pods. A better selector uses labels common to Cilium agent pods:apiVersion: v1 kind: Service metadata: name: cilium-metrics namespace: kube-system # Or wherever Cilium is installed labels: app: cilium spec: selector: k8s-app: cilium # This label is usually present on Cilium agent pods ports: - name: metrics port: 9090 targetPort: 9090 -
Create a
ServiceMonitor: This tells the Prometheus Operator whichServiceto monitor.apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: cilium namespace: monitoring # The namespace where Prometheus is running labels: release: prometheus # Label used by Prometheus Operator to discover ServiceMonitors spec: selector: matchLabels: app: cilium # This must match the labels on your cilium-agent Service namespaceSelector: matchNames: - kube-system # The namespace where your Cilium agent Service is endpoints: - port: metrics # Matches the 'name' in your Service's ports interval: 30s # How often to scrape path: /metrics # The default metrics path for Cilium
The Direct Prometheus Configuration Approach:
If you’re not using the Prometheus Operator, you’ll manually add scrape configurations to your prometheus.yml.
-
Find Cilium Agent Pods: You need to dynamically discover the Cilium agent pods. Kubernetes service discovery is perfect for this.
-
Add to
prometheus.yml:scrape_configs: - job_name: 'cilium-agent' kubernetes_sd_configs: - role: pod relabel_configs: # Only scrape pods with the 'k8s-app: cilium' label - source_labels: [__meta_kubernetes_pod_label_k8s_app] action: keep regex: cilium # Relabel the pod name to be the instance name - source_labels: [__meta_kubernetes_pod_name] target_label: instance # Ensure we scrape the correct port and path - source_labels: [__meta_kubernetes_pod_container_port_name] action: keep regex: metrics - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (\d+) # If no specific port annotation, use the default Cilium metrics port - action: replace target_label: __address__ regex: .* source_labels: [__meta_kubernetes_pod_name, __meta_kubernetes_namespace] replacement: ${1}.${2}:9090 # This might need adjustment based on your cluster setupCorrection: The
replacement: ${1}.${2}:9090line is problematic. Prometheus’skubernetes_sd_configshandles service discovery well. A cleaner approach focuses on identifying the correct pods and ports.A more robust
scrape_configsentry:scrape_configs: - job_name: 'cilium-agent' kubernetes_sd_configs: - role: pod relabel_configs: # Keep only pods that have the 'k8s-app: cilium' label - source_labels: [__meta_kubernetes_pod_label_k8s_app] action: keep regex: cilium # Extract the metrics port. Cilium typically exposes on 9090. # If you have an annotation like prometheus.io/port, use that. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: ${1}:${1} # This assumes port and targetPort are the same # If no annotation, default to 9090 - action: replace source_labels: [__address__] regex: (.*) target_label: __address__ replacement: ${1}:9090 # Set the metrics path, usually /metrics - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - action: replace target_label: __metrics_path__ regex: .* replacement: /metrics # Use pod name as instance label for clarity - source_labels: [__meta_kubernetes_pod_name] target_label: instance
What problem does this solve? This configuration allows Prometheus to discover and collect detailed operational metrics from your Cilium network components. These metrics provide insights into network policy enforcement, BPF map usage, performance, errors, and the overall health of your cluster’s networking layer.
How it works internally:
Cilium components are built with Prometheus client libraries. They expose an HTTP endpoint (usually /metrics) that serves metrics in the Prometheus text format. Prometheus, acting as a client, periodically polls this endpoint. The ServiceMonitor or scrape_configs tell Prometheus which endpoints to poll (based on Kubernetes labels and service discovery) and how often.
The levers you control:
- Scrape Interval: How frequently Prometheus fetches metrics (
intervalinServiceMonitor, orscrape_intervalin Prometheus config). Shorter intervals give more granular data but increase load. - Metrics Path: The HTTP path where metrics are exposed (
pathinServiceMonitor, or__metrics_path__relabeling). Cilium defaults to/metrics. - Port: The network port the metrics are served on (
portinService, orprometheus.io/portannotation). - Selector Labels: The Kubernetes labels used to identify Cilium components for scraping. Consistency here is key.
One thing most people don’t know is that Cilium also exposes metrics for its operator. If you’re using the Prometheus Operator, you’d create a separate Service and ServiceMonitor for the cilium-operator pods, usually listening on port 9091.
The next step is often configuring alerting rules in Prometheus based on these collected Cilium metrics.