A circuit breaker doesn’t just trip and stay tripped; it’s designed to eventually let traffic through again, and setting when that happens is often done with a blunt instrument.
Let’s watch a circuit breaker in action. Imagine a service, product-catalog, that’s having trouble. We’ve got a client service, frontend, that calls product-catalog. If frontend has a circuit breaker configured for product-catalog, it’ll look something like this:
{
"circuitBreakers": [
{
"name": "product-catalog-breaker",
"type": "REACTIVE",
"slidingWindow": {
"size": "10s",
"failureThreshold": 5,
"waitDurationInOpenState": "30s",
"permittedNumberOfCallsInHalfOpenState": 10
},
"thresholds": {
"failureRate": 0.5
}
}
]
}
Here, frontend is configured to be wary of product-catalog. The slidingWindow defines a 10-second period. If product-catalog fails 5 times ( failureThreshold) within that 10 seconds, the breaker trips to the OPEN state. While OPEN, frontend won’t even try to call product-catalog for 30 seconds (waitDurationInOpenState). After 30 seconds, it moves to HALF-OPEN. In HALF-OPEN, it allows 10 calls (permittedNumberOfCallsInHalfOpenState). If all 10 of those succeed, the breaker resets to CLOSED. If any of those 10 fail, it immediately goes back to OPEN for another 30 seconds.
The thresholds.failureRate is where things get interesting. A rate of 0.5 means if 50% of the calls within the sliding window are failures, the breaker trips. This is a more sophisticated way to detect problems than just a raw count, especially if traffic volume fluctuates. It means even if there are many failures, if the proportion of failures is low, the breaker might not trip. Conversely, a few failures during a low-traffic period could still trip it.
The problem this whole mechanism solves is preventing a cascade of failures. If product-catalog is down, frontend hammering it with requests is just making things worse for product-catalog and also wasting resources on the frontend side. The circuit breaker acts as a shock absorber, giving the failing service time to recover and preventing the client from burning itself out.
Internally, the circuit breaker maintains a counter for successful and failed calls within its sliding window. When a call is made, it checks the state:
- CLOSED: If the failure rate (or count, depending on configuration) is below the threshold, the call proceeds. If it fails, the failure count increments. If it succeeds, the success count increments.
- OPEN: If the breaker is
OPEN, the call is immediately rejected without even attempting to reach the downstream service. After thewaitDurationInOpenStateelapses, the state transitions toHALF-OPEN. - HALF-OPEN: A limited number of calls are allowed through. If all these calls succeed, the breaker transitions back to
CLOSED. If any call fails, it immediately reverts toOPEN.
The exact levers you control are the size of the window, the failureThreshold (for count-based tripping), the waitDurationInOpenState, the permittedNumberOfCallsInHalfOpenState, and the failureRate threshold. Tuning these is a delicate dance between being responsive to failures and allowing for transient blips.
What most people don’t realize is that the failureRate threshold is often calculated based on all calls that could have been made within the window, not just the ones that actually completed or timed out. If you have a very slow downstream service, many calls might be pending when the window closes. The circuit breaker implementation might count these pending calls as failures for the purpose of the failure rate calculation, leading to premature tripping. This is why understanding the specific behavior of your circuit breaker library is crucial.
The next thing you’ll likely grapple with is how to monitor the circuit breaker’s state changes and use that information to trigger alerts or automated recovery actions.