The circuit breaker’s "half-open" state isn’t about giving the failing service a second chance; it’s a calculated risk to verify if the underlying issue has actually been resolved.
Imagine a system where service-a calls service-b. If service-b starts failing, service-a’s circuit breaker will trip to the "open" state, preventing further calls to service-b and immediately returning an error. This protects service-b from being overwhelmed and service-a from wasting resources on failing requests.
Now, service-b might recover. But how does service-a know? It can’t just blindly start sending requests again, or it might immediately trip the breaker again. This is where "half-open" comes in.
When the timeout for the "open" state elapses (let’s say 10s), the circuit breaker transitions to "half-open." In this state, it allows one request to service-b to pass through. This single request is the "recovery probe."
Here’s what happens with that probe request:
- Success: If this single request to
service-bsucceeds, the circuit breaker assumesservice-bhas recovered. It then closes the circuit, allowing normal traffic flow again. - Failure: If this single request fails (either a timeout, a specific error code, or a connection error), the circuit breaker immediately trips back to the "open" state. The timeout for the "open" state starts again, and the cycle repeats.
This mechanism is crucial for resilience. It prevents a briefly degraded service from being hammered by requests, but also ensures that as soon as the service might be healthy, we test that hypothesis quickly and efficiently without a full-blown traffic surge.
Let’s look at a common implementation, like Resilience4j in Java.
Consider this configuration snippet:
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Trip if 50% of requests fail
.waitDurationInOpenState(Duration.ofSeconds(5)) // Stay open for 5 seconds
.permittedNumberOfCallsInHalfOpenState(1) // Allow only 1 call in half-open
.build();
If service-a makes 10 calls to service-b and 7 of them fail, the failure rate (70%) exceeds the failureRateThreshold (50%). The circuit breaker trips to "open."
For the next 5 seconds, any call to service-b via this circuit breaker will immediately fail with a CallNotPermittedException.
After 5 seconds, the circuit breaker transitions to "half-open." Now, the next call that goes through this circuit breaker is allowed to proceed to service-b.
If that single call succeeds, the circuit breaker is happy. It resets its failure count and transitions back to "closed," allowing all subsequent calls.
If that single call fails, the circuit breaker immediately reverts to "open," and the 5-second waitDurationInOpenState timer restarts. The next probe won’t happen for another 5 seconds.
The permittedNumberOfCallsInHalfOpenState is key. Setting it to 1 is the most common and often safest approach. It means we are performing a single, isolated test of the downstream service’s health. If you were to set it to, say, 5, and all 5 succeeded, that would be a stronger signal of recovery. However, it also means a recovering service could still be overloaded by those 5 concurrent probes if it’s only just starting to stabilize.
The "half-open" state is a transient, experimental phase. It’s not about resuming full service but about cautiously validating that the experimental conditions (the potential recovery of service-b) are met. The entire point is to minimize the blast radius of a failure while being sensitive to recovery.
The next logical step after understanding how the circuit breaker recovers is to consider how to prevent it from tripping in the first place, which often leads into topics like request retries and load shedding.