Prometheus and Grafana are not just tools for pretty graphs; they form the bedrock of a system that tells you when things are about to break before they do.
Let’s see it in action. Imagine you’ve got a web service. We’ll use a simple Go app that exposes a /metrics endpoint.
package main
import (
"fmt"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
opsCounter = promauto.NewCounter(prometheus.CounterOpts{
Name: "myapp_operations_total",
Help: "The total number of processed operations.",
})
)
func handler(w http.ResponseWriter, r *http.Request) {
opsCounter.Inc() // Increment the counter for each request
fmt.Fprintf(w, "Hello, World!")
}
func main() {
http.HandleFunc("/", handler)
http.Handle("/metrics", promhttp.Handler()) // Expose Prometheus metrics
fmt.Println("Server starting on :8080")
http.ListenAndServe(":8080", nil)
}
When this app runs, it’s already emitting metrics. A quick curl localhost:8080/metrics would show something like:
# HELP myapp_operations_total The total number of processed operations.
# TYPE myapp_operations_total counter
myapp_operations_total 123
This is the raw data. Prometheus’s job is to scrape this data from your applications and store it in a time-series database. Grafana’s job is to visualize that data, turning raw numbers into meaningful charts and dashboards.
Here’s how Prometheus is configured to scrape our app:
# prometheus.yml
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['localhost:8080']
Once Prometheus is running and configured, it hits localhost:8080 every 15 seconds (by default) and pulls the metrics. You’d then add a data source in Grafana pointing to your Prometheus server. You can then create panels that query Prometheus for myapp_operations_total and display it as a graph.
But what if myapp_operations_total stops increasing? Or worse, what if the entire /metrics endpoint becomes unavailable? That’s where Alertmanager comes in. It defines rules based on Prometheus queries. A simple rule might be: "If rate(myapp_operations_total[5m]) == 0 for 10 minutes, fire an alert."
The full mental model is this:
- Exporters/Instrumented Apps: Your services expose metrics (like the Go app above) or you use dedicated exporters (e.g.,
node_exporterfor system metrics,mysqld_exporterfor MySQL). - Prometheus: Periodically pulls (scrapes) these metrics from targets. It stores them in its time-series database and evaluates alerting rules.
- Alertmanager: Receives alerts fired by Prometheus. It de-duplicates, groups, and routes alerts to various receivers (email, Slack, PagerDuty). It also handles silencing and inhibition of alerts.
- Grafana: Connects to Prometheus as a data source. It allows you to build rich dashboards by querying Prometheus and visualizing the data. It can also display alerts managed by Alertmanager.
The magic of this stack is its declarative nature and its separation of concerns. Prometheus doesn’t need to know how to alert; it just needs to know when to alert. Alertmanager doesn’t need to know how to collect data; it just needs to know what to do with alerts. Grafana doesn’t need to know how to store data; it just needs to know how to query and display it.
A common pitfall is thinking that Prometheus pushes metrics. It doesn’t. It pulls. This is a critical distinction for network design and firewall rules. If your application is behind a firewall that Prometheus can’t reach, it won’t get your metrics. You’ll need to ensure Prometheus can establish outbound connections to your application endpoints.
The next logical step is to explore service discovery, so Prometheus can automatically find and scrape new instances of your application as they come online.