Alert Tuning: Cut Noise, Catch Bugs

Alerts are noisy, and we’ve all been there: drowning in a sea of red on our dashboards, trying to find the one signal in the chaos. The real trick isn’t just setting alerts; it’s designing them to be useful, firing only when there’s a genuine problem and staying silent otherwise.

Let’s imagine we’re monitoring a critical web service, user-auth-service. We want an alert if it’s unhealthy, but we don’t want it firing every time a single request times out if the service is otherwise recovering.

Here’s a typical Prometheus setup for our user-auth-service. We’ll use http_requests_total (a counter that increments with each request) and http_request_duration_seconds (a histogram of request latencies).

# prometheus.yml
scrape_configs:
  - job_name: 'user-auth-service'
    static_configs:
      - targets: ['user-auth-service:8080']

And here’s a basic Prometheus alert rule in alert.rules.yml:

groups:
- name: auth_service_alerts
  rules:
  - alert: HighErrorRate
    expr: |
      sum(rate(http_requests_total{job="user-auth-service", code=~"5.."} [5m]))
      /
      sum(rate(http_requests_total{job="user-auth-service"} [5m]))
      > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate for user-auth-service"
      description: "More than 5% of requests to user-auth-service are failing (HTTP 5xx) over the last 5 minutes."

This alert fires if more than 5% of requests to user-auth-service result in a 5xx error over a 5-minute window, and it has to be in that state for another 5 minutes (for: 5m). This is a good start. It avoids transient blips. But what if the service is slow but not failing? Or what if it’s available but unresponsive? We’re missing nuance.

The real power comes from combining metrics that tell a story. Instead of just looking at error rates, let’s consider availability and latency together.

Here’s a more sophisticated alert that tries to capture true unhealthiness:

groups:
- name: auth_service_alerts
  rules:
  - alert: UserAuthServiceUnhealthy
    expr: |
      (
        # High error rate
        (sum(rate(http_requests_total{job="user-auth-service", code=~"5.."} [5m])) / sum(rate(http_requests_total{job="user-auth-service"} [5m]))) > 0.1
        or
        # High latency for a significant portion of requests
        (histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="user-auth-service"}[5m])) by (le)) > 2)
        or
        # Service is not responding at all (e.g., no metrics being scraped)
        (up{job="user-auth-service"} == 0)
      )
      and
      # Ensure these conditions persist for at least 10 minutes
      (vector(1)) # This is a placeholder, the 'for' clause handles persistence
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "User-auth-service is unhealthy"
      description: |
        The user-auth-service is experiencing critical issues.

        - High error rate: {{ $value.error_rate }}%


        - High 99th percentile latency: {{ $value.latency }}s


        - Service is down: {{ $value.up }}

        (Note: Prometheus doesn't directly expose these named values in the alert annotation,
         this is illustrative of the logic. Actual values would be in the alert context.)

This alert uses or to combine three distinct failure modes, meaning any one of them will trigger the alert if it persists. The for: 10m clause ensures that the condition must be true for a full 10 minutes before firing, preventing flapping.

The first condition, (sum(rate(http_requests_total{job="user-auth-service", code=~"5.."} [5m])) / sum(rate(http_requests_total{job="user-auth-service"} [5m]))) > 0.1, is similar to our previous one but bumps the error rate threshold to 10% and uses a 5-minute window for the rate calculation. This is a more aggressive stance on errors.

The second condition, (histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="user-auth-service"}[5m])) by (le)) > 2), checks if the 99th percentile of request durations exceeds 2 seconds. This means 99% of requests are faster than 2 seconds. If the 99th percentile jumps to, say, 5 seconds, it indicates that even though requests might not be failing with 5xx errors, they are becoming unacceptably slow for a small but significant number of users. We use histogram_quantile which is a PromQL function to estimate quantiles from a histogram’s buckets. Summing the rates over 5 minutes ([5m]) aggregates metrics across all instances of user-auth-service before calculating the quantile.

The third condition, (up{job="user-auth-service"} == 0), is a simple but crucial check. The up metric is a built-in Prometheus metric that indicates whether the target was successfully scraped. If up is 0, it means Prometheus couldn’t even reach the user-auth-service to get metrics, implying the service is likely down or unreachable.

The and (vector(1)) part is a bit of a PromQL trick. The for: 10m clause on the alert rule itself is what enforces the duration. The and (vector(1)) is often seen in complex expressions to ensure the entire expression evaluates to a single vector element, which is then subjected to the for clause. In simpler expressions, it’s often omitted. Here, the for: 10m is the primary mechanism for stability.

The for clause is your best friend for alert stability. It requires the expression to be true continuously for the specified duration. This filters out transient spikes that would otherwise cause alert storms. A common mistake is to set for too low (e.g., 1m or 30s), leading to flapping alerts. For critical services, 5m to 15m is often a good starting point.

One of the most counterintuitive aspects of alert design is that the absence of data can be more critical than bad data. An alert on up == 0 catches total outages. But what about a service that’s partially failing or degraded in a way that doesn’t trigger error codes or high latency? You might need to alert on lack of metrics. For instance, if you expect a certain number of requests per second, and that rate drops to zero, that’s a problem.

groups:
- name: auth_service_alerts
  rules:
  - alert: UserAuthServiceMetricsDied
    expr: sum(rate(http_requests_total{job="user-auth-service"}[5m])) == 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "User-auth-service is not emitting metrics"
      description: "The user-auth-service has stopped emitting http_requests_total metrics for 10 minutes. It might be down or misconfigured."

This alert fires if the total rate of requests (any code) drops to zero for 10 minutes. This catches scenarios where the service might still be responding to health checks but isn’t processing actual traffic, or has crashed in a way that prevents metrics generation.

The next thing you’ll likely grapple with is alert deduplication and grouping. When a single underlying issue causes multiple alerts to fire (e.g., high latency on service A, which causes errors on service B), you want to see one consolidated incident, not a flood of individual alerts. Understanding how Prometheus alert routing and grouping works via Alertmanager will be your next step.