Circuit breakers are a critical resilience pattern, but their settings are often treated as magic numbers rather than tunable parameters.

Here’s what happens when a service is overloaded, using a simplified HTTP client circuit breaker as an example:

{
  "request": {
    "method": "GET",
    "url": "http://backend-service:8080/data",
    "headers": {
      "User-Agent": "my-frontend-app/1.0",
      "X-Request-ID": "abc123xyz"
    }
  },
  "response": {
    "status": 503,
    "body": "Service Unavailable",
    "headers": {
      "Retry-After": "30"
    }
  },
  "timestamp": "2023-10-27T10:00:01Z",
  "duration_ms": 5100
}

The backend-service took 5.1 seconds to respond and returned a 503 Service Unavailable status. If this happens repeatedly, the circuit breaker, configured to trip after a certain number of failures or a high failure rate, will open. When open, it immediately rejects requests to backend-service without even attempting to send them. This prevents the overloaded service from receiving more traffic it can’t handle, and it allows the calling service to fail fast instead of waiting for a timeout. After a configured resetTimeout, the breaker will enter a half-open state, allowing a limited number of requests through to test if the backend has recovered.

Let’s consider the key configuration parameters for a typical circuit breaker implementation (like Resilience4j or Hystrix):

  • failureThreshold: The number of consecutive failures that will cause the circuit breaker to trip from CLOSED to OPEN.
    • Why it matters: Too low, and transient network blips will cause unnecessary outages. Too high, and you’ll overwhelm the backend before the breaker trips.
    • Tuning: Observe your system’s error rates under normal and peak load. If you see occasional 5xx errors during normal operation that resolve quickly, set this higher than the expected maximum number of such transient errors within your slidingWindowSize. For example, if you expect at most 5 transient errors in a minute and your window is 1 minute, a failureThreshold of 10 might be reasonable.
  • slidingWindowSize: The duration (or number of calls) over which failure rate is calculated.
    • Why it matters: Determines how quickly the circuit breaker reacts to a sustained period of errors versus short bursts.
    • Tuning: This should align with your expected recovery time and the latency of your dependencies. If a backend service typically recovers from an overload within 30 seconds, a slidingWindowSize of 60 seconds makes sense. If it’s a very fast-recovering service, a smaller window (e.g., 10 seconds) might be better. If the window is too small, it might be overly sensitive to brief spikes. If it’s too large, it might take too long to detect a persistent problem.
  • failureRateThreshold: The percentage of failures within the slidingWindowSize that will cause the circuit breaker to trip.
    • Why it matters: This is often more robust than a fixed failureThreshold because it adapts to varying traffic volumes. A sudden spike in errors during peak load might still be a small percentage, but it represents a significant absolute number of failed requests.
    • Tuning: Set this based on your acceptable error budget. If you can tolerate a 5% error rate for a dependency over a minute, set it to 5. If you need near-perfect reliability, you might set it to 1 or 2. Monitor your error logs to understand what constitutes a "problematic" error rate.
  • slowCallRateThreshold: The percentage of slow calls within the slidingWindowSize that will cause the circuit breaker to trip.
    • Why it matters: High latency can be as damaging as outright errors, leading to thread exhaustion and cascading failures.
    • Tuning: Define "slow" based on your SLOs (Service Level Objectives). If your backend should respond in under 1 second 99% of the time, set your slowCallThreshold to something like 1000ms. Then, set slowCallRateThreshold to a low percentage, like 5% or 10%. This will trip the breaker if a significant portion of requests are exceeding your latency targets.
  • slowCallDurationThreshold: The duration above which a call is considered "slow."
    • Why it matters: This directly defines what "slow" means for your system.
    • Tuning: Set this slightly above your expected P95 or P99 latency for the dependency under normal load. If your SLO is 500ms P99, setting this to 750ms or 1000ms is a good starting point.
  • permittedNumberOfCallsInHalfOpenState: The number of requests allowed through when the circuit breaker is in the HALF-OPEN state.
    • Why it matters: Controls how aggressively you test for recovery without overwhelming the potentially still-recovering backend.
    • Tuning: This should be small enough to not cause a relapse if the backend is still struggling. A value of 1 or 5 is common. If you have very low traffic, you might need a slightly higher value to get a representative sample.
  • resetTimeout: The duration the circuit breaker stays OPEN before transitioning to HALF-OPEN.
    • Why it matters: Determines how long clients are blocked before attempting to check if the backend is back online.
    • Tuning: This should be long enough for the backend to reasonably be expected to recover from an overload or restart. If your backend takes 5 minutes to redeploy or recover from a database issue, set this to 300s. If it’s a quick-restarting microservice, 30s or 60s might suffice.

Consider your failureRateThreshold and slidingWindowSize together. If your slidingWindowSize is 60 seconds and your failureRateThreshold is 50%, this means that if 50% of requests within that minute fail, the breaker trips. If your traffic volume is low, say 10 requests per minute, 5 failures will trip it. If your traffic volume is high, say 1000 requests per minute, 500 failures will trip it. This rate-based approach is generally more adaptable than a fixed failureThreshold.

The most critical insight is that these settings should not be static. They should be derived from your observed production traffic patterns, error budgets, and SLOs. Regularly review metrics like requests per second, error rates, and latency percentiles for your dependencies to inform these adjustments.

Once your circuit breakers are tuned, the next challenge is managing the behavior of downstream services that consume your circuit-broken responses.

Want structured learning?

Take the full Circuit-breaker course →