Circuit breakers don’t just trip; they actively prevent cascading failures by deliberately refusing to make a connection when a downstream service is unhealthy.
Let’s say you have a User Service that calls out to a Recommendation Service. If the Recommendation Service starts returning errors or timing out, the User Service’s circuit breaker for Recommendation Service will trip. This means the User Service stops sending requests to Recommendation Service for a configured period, returning a fallback (like cached data or an empty list) instead. This protects the Recommendation Service from being overwhelmed and also prevents the User Service from slowing down or crashing due to waiting for unresponsive calls.
Here’s how you might set up alerts for tripped circuit breakers, using Prometheus and Alertmanager as an example.
First, ensure your services are instrumented to expose metrics that Prometheus can scrape. A common pattern is to use client libraries (like Resilience4j for Java, Polly for .NET, or Hystrix for older Java apps) that integrate with Micrometer or directly expose Prometheus-compatible metrics.
The key metric we’re interested in is a counter that increments every time a circuit breaker transitions to an "open" state. For Resilience4j, this might look like resilience4j_circuitbreaker_state_changes{state="OPEN", name="recommendationService"}.
Here’s a Prometheus Service definition that would scrape such metrics:
scrape_configs:
- job_name: 'my-services'
static_configs:
- targets: ['service-a:8080', 'service-b:8080'] # Replace with your service endpoints
metric_relabel_configs:
- source_labels: [__address__]
regex: '(.*):8080'
target_label: instance
replacement: '$1'
In Prometheus, you’d query for circuit breakers that are currently open. A simple query to identify any open circuit breaker would be:
resilience4j_circuitbreaker_state_changes{state="OPEN"} > 0
However, this query only tells you if a state change happened. To alert on a currently open breaker, you need to look at the state itself. Many libraries expose a gauge that reflects the current state, often with values like 0 for CLOSED, 1 for OPEN, and 2 for HALF-OPEN. If your library exposes something like resilience4j_circuitbreaker_state{state="OPEN", name="recommendationService"} == 1, that’s a direct indicator.
If you only have the state change counter, you can infer the current state. A circuit breaker is open if its state changed to OPEN more recently than it changed to CLOSED or HALF-OPEN. This is a bit more complex. A simpler, common approach is to use a rate over a short window on the OPEN state change counter and compare it to the CLOSED state change counter. If the rate of OPEN transitions is high and the rate of CLOSED transitions is low, it’s likely stuck OPEN.
A more robust Prometheus rule for alerting on an open circuit breaker, assuming a state gauge is available (e.g., resilience4j_circuitbreaker_state with state="OPEN" being 1 when open):
groups:
- name: circuitbreaker_alerts
rules:
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state{state="OPEN"} == 1
for: 5m # Alert if the breaker has been open for at least 5 minutes
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} is OPEN for service {{ $labels.instance }}"
description: "The circuit breaker named '{{ $labels.name }}' on instance '{{ $labels.instance }}' has been OPEN for 5 minutes. This indicates a downstream service is likely unavailable or degraded."
This rule expr: resilience4j_circuitbreaker_state{state="OPEN"} == 1 directly checks if the state gauge for an OPEN breaker is 1. The for: 5m ensures that we don’t get alerted by transient blips; the breaker must remain open for at least 5 minutes.
If you don’t have a state gauge and only have state change counters:
groups:
- name: circuitbreaker_alerts_from_changes
rules:
- alert: CircuitBreakerStuckOpen
# This tries to infer if it's stuck open.
# It looks for a recent OPEN transition and no recent CLOSED transition.
expr: |
sum by (name, instance) (rate(resilience4j_circuitbreaker_state_changes{state="OPEN"}[5m])) > 0
and
sum by (name, instance) (rate(resilience4j_circuitbreaker_state_changes{state="CLOSED"}[5m])) == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Circuit breaker {{ $labels.name }} appears stuck OPEN on {{ $labels.instance }}"
description: "The circuit breaker '{{ $labels.name }}' on instance '{{ $labels.instance }}' has had recent OPEN state changes but no recent CLOSED state changes for 5 minutes. It is likely stuck OPEN."
This second rule is a heuristic. It checks if there have been any OPEN transitions in the last 5 minutes AND no CLOSED transitions in the last 5 minutes. If both conditions are true for 5 minutes, it fires. The sum by (name, instance) aggregates these metrics per circuit breaker name and instance.
Once Prometheus detects an alert, it sends it to Alertmanager. Your Alertmanager configuration would need a route to handle severity: critical alerts and send them to your desired notification channel (Slack, PagerDuty, email, etc.).
A minimal Alertmanager receiver configuration might look like this:
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
api_url: 'YOUR_SLACK_WEBHOOK_URL'
send_resolved: true
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] Circuit Breaker {{ .CommonLabels.alertname }} on {{ .CommonLabels.instance }}'
text: '{{ range .Alerts }}*Alert:* {{ .Annotations.summary }}\n*Description:* {{ .Annotations.description }}\n*Details:*{{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}{{ end }}\n{{ end }}'
route:
group_by: ['alertname', 'instance', 'name']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- matchers:
- severity="critical"
receiver: 'slack-notifications'
continue: true # Allows further routing if needed, but for this example, it's the final destination.
The group_by: ['alertname', 'instance', 'name'] is crucial. It ensures that multiple alerts for the same circuit breaker on the same instance (e.g., if multiple metrics fire) are grouped into a single notification. This prevents alert storms.
The summary and description in the Prometheus rule are rendered by Alertmanager. Using templates like {{ $labels.name }} and {{ $labels.instance }} makes the alerts informative. The send_resolved: true in Alertmanager means you’ll get a notification when the circuit breaker returns to a normal state (CLOSED).
The most surprising thing about circuit breaker alerts is that the absence of an alert can also be a problem. If a downstream service is completely dead, its circuit breaker might trip, but if the upstream service’s own health checks are also failing due to the downstream issue, you might not even see the circuit breaker metric scrape succeed.
The next thing you’ll want to tackle is setting up fallback mechanisms and how to automatically recover or reset tripped circuit breakers.