Real-Time Monitoring: Proactive Ops for Engineers

The most surprising thing about monitoring distributed systems is that "real-time" often means minutes, not milliseconds, and that’s usually by design.

Let’s say you’ve got a microservice architecture. Service A calls Service B, which calls Service C. You want to know if Service B is slow, or if it’s returning errors.

Here’s a simplified view of what that might look like in a Prometheus setup, which is a popular open-source monitoring and alerting system.

# prometheus.yml
scrape_configs:
  - job_name: 'service-a'
    static_configs:
      - targets: ['service-a:9090'] # Prometheus scrapes metrics from Service A's /metrics endpoint

  - job_name: 'service-b'
    static_configs:
      - targets: ['service-b:9090']

  - job_name: 'service-c'
    static_configs:
      - targets: ['service-c:9090']

When Service A makes a request to Service B, it might increment a counter like http_requests_total{service="service-b", method="POST", status_code="200"}. Service B, upon receiving the request, might start a timer, and when it finishes processing and sends a response, it would record the duration: http_request_duration_seconds{service="service-b", method="POST", status_code="200"} 0.15. These metrics are exposed on a /metrics endpoint (usually port 9090) that Prometheus periodically scrapes.

Prometheus, configured to scrape these endpoints, collects these time-series data points. You can then query this data using PromQL (Prometheus Query Language). To see the error rate for Service B, you might write a query like this:

sum(rate(http_requests_total{service="service-b", status_code=~"5.."} [5m])) by (status_code)
/
sum(rate(http_requests_total{service="service-b"} [5m])) by (status_code)
* 100

This query calculates the percentage of requests to Service B that returned a 5xx status code over the last 5 minutes. The rate() function looks at the increase in the counter over a specified time window (here, 5 minutes), and dividing the error rate by the total rate gives you the percentage.

Dashboards, often built with tools like Grafana, visualize these PromQL queries. You’d create panels showing request rates, error rates, latency percentiles (e.g., 95th percentile duration), and system resource usage (CPU, memory) for each service.

// Grafana Dashboard Panel Example (simplified)
{
  "title": "Service B - Error Rate (%)",
  "type": "graph",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{service=\"service-b\", status_code=~\"5..\"} [5m])) by (status_code) / sum(rate(http_requests_total{service=\"service-b\"} [5m])) by (status_code) * 100",

      "legendFormat": "{{status_code}} Errors"

    }
  ],
  "yAxes": [
    {
      "format": "percent",
      "label": null,
      "logBase": 1,
      "max": null,
      "min": 0,
      "show": true
    }
  ]
}

Alerting is where you define thresholds on these metrics. Using Prometheus’s Alertmanager, you’d write rules like:

# alertmanager.yml (simplified alert rule)
groups:
- name: service-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{status_code=~"5.."} [5m])) by (job)
      /
      sum(rate(http_requests_total[5m])) by (job)
      * 100 > 5
    for: 5m # The condition must be true for 5 minutes before firing
    labels:
      severity: critical
    annotations:

      summary: "High error rate on {{ $labels.job }}"


      description: "Job {{ $labels.job }} has an error rate above 5% for the last 5 minutes."

This rule states: "If the percentage of 5xx errors across any job (which maps to a service in our Prometheus config) is greater than 5% for five consecutive minutes, fire a 'HighErrorRate' alert with 'critical' severity." Alertmanager then routes this alert to your chosen notification channels (Slack, PagerDuty, email).

The core problem this whole system solves is observability in a distributed environment. When a request fails, it’s rarely a single point of failure. It could be a network blip between services, a downstream dependency timing out, or a resource exhaustion issue on one particular node. Without detailed metrics, tracing, and logs, pinpointing the root cause is like finding a needle in a haystack. Dashboards give you the overview, and alerts tell you when the haystack is on fire.

Most people don’t realize how much data Prometheus actually discards by default. It’s designed to efficiently store and query recent data. For longer-term storage, you typically integrate Prometheus with a long-term storage solution like Thanos or Cortex, or use a managed service. Without this, your historical analysis capabilities are limited to the scrape interval and configured retention period, which might only be a few hours or days.

The next concept you’ll grapple with is distributed tracing, which complements metrics by showing the path of a single request across multiple services, revealing latency bottlenecks at the individual hop level.