The most surprising thing about circuit breakers is that their primary function isn’t to prevent errors, but to manage them gracefully, transforming cascading failures into predictable, isolated incidents.

Let’s see this in action with a common pattern: a Java application using Resilience4j, exposing metrics to Prometheus, and visualizing them in Grafana.

Imagine a service, let’s call it product-catalog, that depends on an external inventory-service. If inventory-service becomes slow or unresponsive, product-catalog shouldn’t just keep hammering it, eventually crashing itself. Instead, a circuit breaker wrapped around the call to inventory-service will detect the failures.

Here’s a snippet of how that might look in code (using Resilience4j):

// Configure the circuit breaker
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(50) // 50% of failures in a sliding window
    .waitDurationInOpenState(Duration.ofSeconds(5)) // Stay open for 5 seconds
    .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
    .slidingWindowSize(10) // Look at the last 10 calls
    .build();

CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker("inventoryService");

// Wrap the call to the inventory service
Supplier<List<Product>> inventoryCall = () -> inventoryService.getInventoryLevels(productId);
Supplier<List<Product>> decoratedInventoryCall = CircuitBreaker.decorateSupplier(circuitBreaker, inventoryCall);

try {
    List<Product> inventory = decoratedInventoryCall.get();
    // Process inventory
} catch (Exception e) {
    // Handle fallback or re-throw
    log.error("Failed to get inventory", e);
}

Resilience4j automatically exposes metrics for each circuit breaker to Micrometer, which Prometheus scrapes. You’ll see metrics like:

  • resilience4j_circuitbreaker_calls_total{name="inventoryService", state="SUCCESS"}: Count of successful calls.
  • resilience4j_circuitbreaker_calls_total{name="inventoryService", state="FAILURE"}: Count of failed calls.
  • resilience4j_circuitbreaker_state{name="inventoryService", state="OPEN"}: 1 if the circuit is open, 0 otherwise.
  • resilience4j_circuitbreaker_state{name="inventoryService", state="CLOSED"}: 1 if the circuit is closed, 0 otherwise.
  • resilience4j_circuitbreaker_state{name="inventoryService", state="HALF_OPEN"}: 1 if the circuit is half-open, 0 otherwise.

Now, let’s bring in Grafana. You’ll need a Prometheus data source configured in Grafana. Then, you can create panels to visualize these metrics.

A good starting point is a row for each critical circuit breaker. For the inventoryService breaker, you might have these panels:

  1. Circuit Breaker State:

    • Query: resilience4j_circuitbreaker_state{name="inventoryService"}
    • Visualization: Stat or Gauge.
    • Settings: Set thresholds for colors (e.g., Red for OPEN, Yellow for HALF_OPEN, Green for CLOSED). This gives you an immediate visual cue.
  2. Call Success Rate:

    • Query: sum(rate(resilience4j_circuitbreaker_calls_total{name="inventoryService", state="SUCCESS"}[5m])) / sum(rate(resilience4j_circuitbreaker_calls_total{name="inventoryService"}[5m])) * 100
    • Visualization: Graph.
    • Settings: Show as percentage. This shows how often calls are succeeding when they are allowed to proceed.
  3. Failure Rate (within the sliding window):

    • Query: rate(resilience4j_circuitbreaker_calls_total{name="inventoryService", state="FAILURE"}[1m]) / sum(rate(resilience4j_circuitbreaker_calls_total{name="inventoryService"}[1m])) * 100
    • Visualization: Graph.
    • Settings: Show as percentage. This metric is crucial because it directly informs the circuit breaker’s decision to open. You’d want to see this spike before the breaker state turns OPEN.
  4. Total Calls:

    • Query: sum(rate(resilience4j_circuitbreaker_calls_total{name="inventoryService"}[5m]))
    • Visualization: Graph.
    • Settings: Show as "Calls/sec". This helps understand the load on the circuit breaker.

The mental model is this: the circuit breaker acts as a proxy. It intercepts calls to a downstream service. Initially, it’s CLOSED, allowing all calls through. It tracks successes and failures within a defined window (e.g., last 10 calls). If the failure rate exceeds a threshold (e.g., 50%), it trips to the OPEN state. In OPEN state, it immediately rejects subsequent calls without even attempting to contact the downstream service, returning an error or a fallback. After a configured waitDurationInOpenState (e.g., 5 seconds), it transitions to HALF_OPEN. In HALF_OPEN, it allows a single call to pass through. If that call succeeds, the breaker closes; if it fails, it immediately re-opens. This cycle allows the downstream service time to recover while protecting the upstream caller from constant failures.

A common pitfall is configuring the slidingWindowSize and failureRateThreshold too aggressively or too leniently. If the failureRateThreshold is too low (e.g., 10%), the breaker might trip on transient network blips. If it’s too high (e.g., 90%), it might keep hammering a truly broken service. The waitDurationInOpenState is also critical; too short, and you don’t give the downstream service enough breathing room; too long, and your users experience unavailability for extended periods.

By observing the resilience4j_circuitbreaker_state alongside the resilience4j_circuitbreaker_calls_total metrics for SUCCESS and FAILURE, you can see the breaker’s lifecycle in response to real-world conditions.

The next logical step is to correlate these circuit breaker events with actual error logs or user-facing latency spikes in your application.

Want structured learning?

Take the full Circuit-breaker course →