Prometheus and Grafana don’t just show you your system’s metrics; they fundamentally change how you think about its health and performance.
Let’s get them running on Azure Kubernetes Service (AKS). We’ll use Helm, the Kubernetes package manager, for this.
First, add the Prometheus community Helm repository:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
Now, let’s install Prometheus. We’ll create a values.yaml file to customize the installation.
# prometheus-values.yaml
server:
persistentVolume:
enabled: true
storageClass: "azurefile-csi" # Or your preferred AKS storage class
size: 10Gi
alertmanager:
persistentVolume:
enabled: true
storageClass: "azurefile-csi" # Or your preferred AKS storage class
size: 2Gi
The storageClass is crucial for persistent storage. On AKS, azurefile-csi is a common choice for general-purpose persistent volumes. If you need higher performance, you might opt for azure-disk-csi with skuName: Premium_LRS.
Install Prometheus with your custom values:
helm install prometheus prometheus-community/prometheus -f prometheus-values.yaml -n monitoring --create-namespace
This command installs Prometheus into a monitoring namespace. It sets up the Prometheus server and Alertmanager, ensuring their data persists across pod restarts using Azure File shares.
Wait for the pods to be ready:
kubectl get pods -n monitoring
You should see prometheus-server-0 and alertmanager-0 (and associated config-reloader pods) running.
To access Prometheus, we’ll use kubectl port-forward.
kubectl port-forward svc/prometheus-server 9090:80 -n monitoring
Now, open your browser to http://localhost:9090. You should see the Prometheus UI.
Next, Grafana. We’ll install it similarly, also using Helm.
helm install grafana prometheus-community/grafana -n monitoring
This installs Grafana with default settings. It will provision a persistent volume for Grafana’s data.
To get the Grafana admin password:
kubectl get secret --namespace monitoring grafana -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
Now, port-forward Grafana:
kubectl port-forward svc/grafana 3000:80 -n monitoring
Access Grafana at http://localhost:3000. Log in with admin and the password you just retrieved.
Once logged in, you need to add Prometheus as a data source. Go to Configuration (gear icon) -> Data Sources -> Add data source.
Select "Prometheus". For the URL, enter http://prometheus-server.monitoring.svc.cluster.local. This is the internal Kubernetes service name for Prometheus.
Save and test. You should see "Data source is working".
Now, let’s create a dashboard. You can import pre-built dashboards or create your own. A popular one for Kubernetes is the "Kubernetes cluster monitoring (via Prometheus)" dashboard (ID 10000 or similar, search on Grafana.com dashboards).
To import a dashboard:
- Click the "+" icon in the left sidebar -> Import.
- Paste the dashboard ID (e.g.,
10000) or upload a JSON file. - Select your Prometheus data source.
- Click "Import".
You’ll now see Kubernetes metrics like pod restarts, CPU/memory usage, and network traffic.
The most surprising thing about Prometheus and Grafana is how they encourage you to move from reactive troubleshooting ("the app is down, what’s wrong?") to proactive observation ("this metric is trending upwards, it might cause an outage soon").
Internally, Prometheus scrapes metrics from configured targets (your Kubernetes pods, nodes, etc.) at regular intervals. It stores these time-series data points in its TSDB. Grafana queries this data using the PromQL query language and visualizes it in graphs and panels. The kube-state-metrics and node-exporter (often installed as part of the Prometheus Helm chart or as separate deployments) are critical for exposing Kubernetes object states and node-level metrics, respectively.
What most people don’t realize is that Prometheus’s scraping mechanism is pull-based by default. This means Prometheus actively asks targets for their metrics. For services that can’t be easily discovered or don’t expose an HTTP endpoint for Prometheus to scrape, you’d typically use Prometheus’s remote_write capability to push metrics to a central Prometheus instance, or configure service discovery mechanisms like Kubernetes SD.
The next step is configuring alerting rules in Prometheus and integrating them with Alertmanager to send notifications to Slack, PagerDuty, or email.