Circuit breakers are your last line of defense against cascading failures, and when they trip, it’s usually because something else has already gone wrong.

Here’s how you monitor them, and crucially, how you know when they’re actually recovered.

Let’s look at a real-world scenario using Netflix’s Hystrix, a popular circuit breaker library. Imagine a service UserService that depends on a ProfileService. If ProfileService starts failing, UserService will use a circuit breaker to stop calling it, preventing UserService from also crashing due to timeouts or errors.

Monitoring the State

The most direct way to see what a circuit breaker is doing is through its exposed metrics. Hystrix publishes these via JMX or can be configured to send them to a monitoring system like Prometheus.

A typical Hystrix circuit breaker reports its state as one of three: CLOSED, OPEN, or HALF-OPEN.

  • CLOSED: The circuit is normal. Requests are flowing through to the dependent service.
  • OPEN: The circuit has tripped. Requests are failing fast (short-circuiting) without even attempting to call the dependent service. This state is temporary.
  • HALF-OPEN: The circuit has been OPEN for a configured duration. It’s now allowing a single request to pass through to see if the dependent service has recovered. If that single request succeeds, the breaker will transition back to CLOSED. If it fails, it immediately goes back to OPEN.

Key Metrics to Watch:

You’ll want to monitor several metrics for each circuit breaker:

  1. requestVolume: The total number of requests made to the command behind the circuit breaker.
  2. errorPercentage: The percentage of requests that resulted in an error (exceptions, timeouts). This is the primary driver for tripping.
  3. rollingStatisticalWindowBuckets: Hystrix uses rolling windows to calculate error rates. This metric tells you how many buckets (time intervals) are being used in that calculation.
  4. circuitBreakerState: The current state (CLOSED, OPEN, HALF-OPEN).
  5. latencyTotal: Total latency for requests.
  6. successes, failures, timeouts, rejections, threadPoolRejectedInvocations: More granular counts of what’s happening.

Example Monitoring Setup (Prometheus/Grafana)

If you’re using Spring Boot with Hystrix, you can expose these metrics. Add the micrometer-registry-prometheus dependency.

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

Then, in your application properties:

management.endpoints.web.exposure.include=prometheus,health,info
management.endpoint.prometheus.enabled=true
hystrix.metrics.enabled=true

Your Prometheus configuration would scrape the /actuator/prometheus endpoint. In Grafana, you’d create panels for:

  • Circuit Breaker State: A Stat panel showing the circuitBreakerState for key breakers.
  • Error Percentage: A Graph panel showing errorPercentage for critical dependencies. You’d alert if this exceeds a threshold (e.g., 50% for more than 30 seconds).
  • Request Volume: A Graph panel showing requestVolume to understand traffic patterns.
  • Circuit Breaker Trips: A Time Series panel showing when circuitBreakerState transitions to OPEN. This is your primary alert for a problem.

Diagnosing a Trip

When you see a circuit breaker in the OPEN state, it’s a symptom, not the root cause. The actual problem lies in the dependency it’s protecting.

Common Causes (Most to Least Common):

  1. Dependency Service Overload/Unresponsiveness: The ProfileService is struggling to keep up with requests.

    • Diagnosis: Check the ProfileService’s own metrics (CPU, memory, network I/O, request latency, error rates). Look for increased latency or error rates before the circuit breaker tripped. Use kubectl top pod <profile-service-pod> or cloud provider monitoring.
    • Fix: Scale up the ProfileService (e.g., kubectl scale deployment profile-service --replicas=5). This increases its capacity to handle load.
    • Why it works: More instances mean more processing power and resources, allowing the service to respond to requests faster and with fewer errors.
  2. Network Issues Between Services: Latency or packet loss between UserService and ProfileService.

    • Diagnosis: Use ping or traceroute from the UserService pod to the ProfileService’s IP. Check network metrics in your Kubernetes cluster or cloud provider. Look for increased network error rates or latency on the nodes hosting these services.
    • Fix: Investigate your network infrastructure. This could involve checking Kubernetes CNI, cloud provider VPC configurations, or physical network hardware. Ensure sufficient bandwidth and low latency paths.
    • Why it works: A stable, low-latency network ensures requests reach the ProfileService promptly and responses return quickly, reducing timeouts.
  3. Resource Exhaustion on Dependency Pod: The ProfileService pod is out of CPU or memory.

    • Diagnosis: Check kubectl top pod <profile-service-pod> -n <namespace>. Look for CPU utilization consistently at 100% or memory usage near the limit. Examine pod events: kubectl get events -n <namespace> --field-selector involvedObject.name=<profile-service-pod>.
    • Fix: Increase the resource requests/limits in the ProfileService’s Kubernetes deployment manifest. For example, change resources: { requests: { cpu: "200m", memory: "512Mi" }, limits: { cpu: "500m", memory: "1Gi" } } to resources: { requests: { cpu: "400m", memory: "1Gi" }, limits: { cpu: "1000m", memory: "2Gi" } }.
    • Why it works: Providing more CPU or memory allows the application process within the pod to execute its code and manage its memory more effectively, preventing it from becoming unresponsive.
  4. Database/External Dependency Issues: The ProfileService itself relies on another service (e.g., a database) that is slow or failing.

    • Diagnosis: Check the logs and metrics of the ProfileService for errors related to its own dependencies. If ProfileService uses a database, check the database’s health, connection pool usage, and query performance.
    • Fix: Address the underlying issue with the ProfileService’s dependencies. This might mean optimizing database queries, scaling the database, or fixing the failing external service.
    • Why it works: When the ProfileService’s own dependencies are healthy, it can process requests reliably, thereby not causing a backlog or errors that would eventually trip the upstream circuit breaker.
  5. Configuration Errors in Dependency: Incorrect configuration in the ProfileService causing it to behave erratically.

    • Diagnosis: Review recent configuration changes for the ProfileService. Check its logs for startup errors or runtime warnings related to configuration.
    • Fix: Revert or correct the problematic configuration. For example, an incorrect connection string to a database or an invalid thread pool size.
    • Why it works: Correct configuration ensures the service starts and runs as expected, without internal logical errors that lead to failures.
  6. Hystrix Configuration Too Aggressive: The circuit breaker is too sensitive.

    • Diagnosis: Examine the Hystrix configuration for the circuit breaker. Key parameters include:
      • circuitBreaker.requestVolumeThreshold: The minimum number of requests in a rolling window to trigger a trip (default 20).
      • circuitBreaker.errorThresholdPercentage: The percentage of errors to trigger a trip (default 50%).
      • circuitBreaker.sleepWindowInMilliseconds: How long to stay OPEN before transitioning to HALF-OPEN (default 5000ms).
    • Fix: Increase circuitBreaker.requestVolumeThreshold (e.g., to 50) or circuitBreaker.errorThresholdPercentage (e.g., to 75), or increase circuitBreaker.sleepWindowInMilliseconds (e.g., to 15000). This makes the breaker less likely to trip on transient blips.
    • Why it works: Relaxing these thresholds means the circuit breaker requires more evidence of sustained failure before it opens, reducing false positives from temporary network glitches or brief service hiccups.

Monitoring Recovery

The circuit breaker entering the HALF-OPEN state is the first sign of potential recovery. You’ll see the circuitBreakerState metric change.

  • Successful Recovery: If the single request allowed through in HALF-OPEN succeeds, the breaker immediately returns to CLOSED. You’ll see circuitBreakerState change from HALF-OPEN back to CLOSED. Crucially, the error rate (errorPercentage) should drop to near zero, and request volume should return to normal levels.
  • Failed Recovery: If the single request fails, the breaker immediately returns to OPEN. You’ll see circuitBreakerState transition from HALF-OPEN back to OPEN. This indicates the underlying problem persists.

The Next Error You’ll Hit

Once you’ve fixed the root cause of circuit breaker trips and they are consistently staying CLOSED, the next system-level issue you’ll encounter is likely related to resource contention or saturation on the caller service (UserService) if it was previously being protected from overwhelming the downstream ProfileService. As the ProfileService recovers and the circuit breaker stays CLOSED, the UserService might now be dealing with a backlog of requests that it previously couldn’t send, potentially leading to its own resource issues (thread pool exhaustion, memory leaks from queued requests).

Want structured learning?

Take the full Circuit-breaker course →