Metrics Collection Systems: The Engineering Deep Dive

The most surprising truth about collecting metrics at scale is that the system you choose fundamentally dictates the kinds of questions you can ask about your distributed system’s behavior.

Imagine you’ve got a web service running across a hundred machines. Each machine is independently sending out metrics: CPU usage, request latency, error counts. Where do these go? You need a collector, a distributor, and a storer.

Let’s say we’re using Prometheus. Here’s a basic setup. On each of your web service instances, you’ll have a Prometheus client library (e.g., prometheus_client for Python) instrumenting your code.

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
import random

# Define some metrics
http_requests_total = Counter('http_requests_total', 'Total number of HTTP requests received')
request_latency_seconds = Histogram('request_latency_seconds', 'HTTP request latency in seconds', buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0])
active_connections = Gauge('active_connections', 'Number of active connections')

def handle_request(request):
    start_time = time.time()
    try:
        # Simulate work
        time.sleep(random.uniform(0.1, 3.0))
        if random.random() < 0.05:
            raise Exception("Simulated error")
        http_requests_total.inc()
        latency = time.time() - start_time
        request_latency_seconds.observe(latency)
    except Exception as e:
        print(f"Error: {e}")
    finally:
        active_connections.dec()

if __name__ == '__main__':
    # Start up the server to expose the metrics.
    start_http_server(8000)
    print("Metrics exposed on port 8000")

    # Simulate incoming requests
    while True:
        active_connections.inc()
        handle_request(None) # Pass None as we are not using the request object here
        time.sleep(0.5)

This code starts an HTTP server on port 8000. Prometheus, configured to scrape this endpoint, will periodically fetch the metrics.

A Prometheus server configuration (prometheus.yml) might look like this:

scrape_configs:
  - job_name: 'my_web_service'
    static_configs:
      - targets: ['webserver1:8000', 'webserver2:8000', 'webserver3:8000']

Prometheus then stores these metrics in its time-series database. The core of its querying power comes from PromQL. You can ask:

"What’s the average request latency over the last 5 minutes?" avg_over_time(request_latency_seconds_sum[5m]) / avg_over_time(request_latency_seconds_count[5m])
"How many requests per second are we serving?" rate(http_requests_total[1m])
"What percentage of requests are failing?" sum(rate(http_requests_total{status="5xx"}[5m])) / sum(rate(http_requests_total[5m])) * 100

The system solves the problem of centralizing and querying operational data from many distributed components. Internally, Prometheus uses a pull-based model for scraping targets. It stores data as time series, identified by metric name and a set of key-value pairs called labels. Labels are crucial: {"instance": "webserver1", "job": "my_web_service", "path": "/api/v1"}. They allow you to slice and dice your metrics.

The mental model is: data points with rich metadata. Each metric is a stream of values over time, and labels are the dimensions you can filter and group by. This is why choosing your labels carefully upfront is critical. Adding a new label later is a schema change.

The real power comes from combining metrics. You can correlate request latency with CPU usage on specific instances, or track error rates per API endpoint.

The one thing most people don’t realize is that the resolution of your data is directly tied to your scraping interval and how often your application emits metrics. If you scrape every minute and your application only updates a counter once every 30 seconds, you’ll miss transient spikes that happen between scrapes. Conversely, scraping and emitting too frequently can overwhelm your network and storage.

The next problem you’ll encounter is how to visualize these metrics effectively and set up alerts when things go wrong.