Prometheus and Grafana are the dynamic duo of modern DevOps monitoring, but their real magic isn’t just in collecting metrics; it’s in how they help you anticipate failure.

Let’s see them in action. Imagine you have a web service deployed on Kubernetes.

First, Prometheus needs to scrape metrics. Your Prometheus configuration (prometheus.yml) might look something like this, with scrape targets defined:

scrape_configs:
  - job_name: 'my-web-service'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: my-web-app
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.*)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.*)
      - target_label: __instance__
        source_labels: [__meta_kubernetes_pod_name]

This configuration tells Prometheus to discover pods labeled app: my-web-app that have the annotation prometheus.io/scrape: "true". It then uses the prometheus.io/port annotation for the scrape address and prometheus.io/path for the metrics path (usually /metrics).

Once Prometheus is scraping, Grafana visualizes it. A typical Grafana dashboard might show these panels:

  • HTTP Request Rate: rate(http_requests_total[5m])
  • HTTP Error Rate: rate(http_requests_total{code=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
  • CPU Usage: sum(rate(container_cpu_usage_seconds_total{namespace="my-app", pod=~"my-web-app-.*"}[5m])) by (pod)
  • Memory Usage: sum(container_memory_working_set_bytes{namespace="my-app", pod=~"my-web-app-.*"}) by (pod)
  • Network Traffic: sum(rate(container_network_receive_bytes_total{namespace="my-app", pod=~"my-web-app-.*"}[5m])) by (pod)

These panels offer a snapshot. But the real power comes from understanding the relationships and setting up alerts.

The Problem This Solves: In a distributed system, a single service can be affected by many others. A slowdown in a database, a network blip, or an upstream API error can cascade. Without a unified view, pinpointing the root cause of user-facing issues becomes a treasure hunt in the dark. Prometheus and Grafana provide that unified, time-series-based visibility, allowing you to see not just that something is wrong, but when and where it started.

How It Works Internally: Prometheus is a pull-based system. It actively scrapes HTTP endpoints on your applications or infrastructure for metrics exposed in a specific text format. These metrics are time-series data: a stream of timestamped values for each unique metric label combination. Prometheus stores this data efficiently in its time-series database. Grafana, on the other hand, is a visualization tool. It connects to Prometheus (and other data sources) and queries the time-series data to render graphs, tables, and other visual representations. Alerting is handled by Prometheus Alertmanager, which Prometheus itself feeds with alert rules defined in its configuration.

The Exact Levers You Control:

  1. Scraping Configuration (prometheus.yml): This dictates what Prometheus collects. You define job_name, scrape intervals, target discovery mechanisms (Kubernetes service discovery, file-based discovery, etc.), and relabeling rules to filter and modify metadata.
  2. Metric Exposure: Your applications must expose metrics in Prometheus’s format. Libraries like prometheus_client (Python), client_golang (Go), or micrometer (Java) make this straightforward. You define custom metrics (counters, gauges, histograms, summaries) and update their values as your application logic executes.
  3. Grafana Dashboard Design: This is about how you visualize. You choose which metrics to display, the time ranges, graph types (lines, bars, heatmaps), and how to combine related metrics to tell a coherent story about system health.
  4. Alerting Rules (rules.yml): These are PromQL expressions that, when true, trigger alerts. Examples: up{job="my-web-service"} == 0 (service is down), rate(http_requests_total{code=~"5.."}[5m]) > 50 (error rate too high), container_memory_usage_bytes{pod="my-web-app-xyz"} / container_spec_memory_limit_bytes{pod="my-web-app-xyz"} > 0.9 (memory nearing limit).

When you set up an alert rule like sum(rate(container_cpu_usage_seconds_total{namespace="my-app"}[5m])) by (pod) > 0.8 * sum(kube_pod_container_resource_limits{resource="cpu", namespace="my-app"}) by (pod), and then configure Alertmanager to send notifications, Prometheus doesn’t just passively record data. It actively evaluates this expression every scrape interval. If the condition CPU usage > 80% of its limit is met for any pod in the my-app namespace, Prometheus fires an alert to Alertmanager, which then dedupes, groups, and routes it to Slack, PagerDuty, or email. This proactive evaluation is what transforms monitoring from a rearview mirror into a predictive tool.

The most surprising thing about Prometheus’s query language, PromQL, is its ability to perform arithmetic and logical operations across time ranges and across different metric types. You can, for instance, calculate the percentage of requests that resulted in a 5xx error by dividing the rate of 5xx errors by the total rate of requests, all within a single query: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])). This composability allows for incredibly sophisticated, context-aware metrics that don’t exist directly in your application code.

The next step after mastering basic dashboards and alerts is understanding distributed tracing integration with Prometheus and Grafana, which bridges the gap between metrics and the actual request flow.

Want structured learning?

Take the full DevOps & Platform Engineering course →