A sliding window circuit breaker doesn’t actually track the number of failures; it tracks the rate of failures within a rolling time interval.

Let’s watch one in action. Imagine a microservice called user-service that handles user profile lookups. We want to protect it from cascading failures if it starts returning errors.

{
  "service": "user-service",
  "circuitBreaker": {
    "failureRateThreshold": 0.5,  // If 50% of requests fail in the window
    "windowSize": "10s",         // The rolling time window
    "minimumNumberOfCalls": 5,   // Must have at least 5 calls to trigger
    "permittedNumberOfCallsInHalfOpenState": 3,
    "status": "CLOSED"
  }
}

Here’s a sequence of events over 15 seconds:

  • 0-5s: 4 successful calls to user-service. status remains CLOSED.
  • 5-10s: 3 successful calls, 2 failed calls. Total calls in the window: 5. Failure rate: 2/5 = 0.4. status remains CLOSED.
  • 10-15s: 1 successful call, 4 failed calls. The window now covers requests from 5s to 15s. In this window, there have been 6 calls (3 from 5-10s, 3 from 10-15s). The failures within this window are the 2 from 5-10s and the 4 from 10-15s. Wait, that’s not right. The sliding window means we only care about the last 10 seconds.
    • At 10s, the window is 0s-10s. Requests: 4 (0-5s) + 3 (5-10s) = 7. Failures: 0 + 2 = 2. Rate: 2/7 = 0.28.
    • At 10.1s, the window is 0.1s-10.1s.
    • At 15s, the window is 5s-15s.
      • Requests from 5s to 15s: 3 (5-10s) + 1 (10-15s) = 4.
      • Failures from 5s to 15s: 2 (5-10s) + 4 (10-15s) = 6.
      • The total number of calls in the window 5s-15s is 4 successful + 6 failed = 10 calls.
      • The failure rate is 6/10 = 0.6.
      • Since 0.6 > failureRateThreshold (0.5) and the minimumNumberOfCalls (5) has been met, the status changes to OPEN. Subsequent calls to user-service will immediately return a fallback error without even attempting to call the service.

The core problem this solves is preventing a struggling downstream service from being overwhelmed. When a service starts failing, the circuit breaker "opens," stopping traffic to it. This gives the failing service a chance to recover.

Internally, the breaker maintains a sliding window of time. For each request, it records its success or failure and a timestamp. When calculating the failure rate, it discards any records older than windowSize and then computes failures / total_calls from the remaining records.

The failureRateThreshold is the percentage of failures that triggers the breaker. windowSize determines how far back the breaker looks. minimumNumberOfCalls prevents the breaker from opening based on a few unlucky requests at low traffic. permittedNumberOfCallsInHalfOpenState controls how many requests are allowed through when the breaker is in a "testing the waters" state.

The most surprising thing is how the windowSize and minimumNumberOfCalls interact to smooth out transient spikes. If you have a windowSize of 60s and minimumNumberOfCalls of 100, a brief 5-second burst of 20 failures won’t trip the breaker because, within the 60-second window, those 20 failures are a small fraction of the total calls. However, if those 20 failures persist for several seconds and the total calls in the window remain low, the failure rate can climb above the threshold.

The next thing you’ll grapple with is how to implement effective fallback strategies when the circuit breaker is OPEN.

Want structured learning?

Take the full Circuit-breaker course →