Circuit breakers, when implemented incorrectly, can actually destabilize your system rather than protect it.
Let’s look at a common scenario: a service user-service depends on auth-service. When auth-service starts failing, user-service is supposed to trip its circuit breaker and stop calling auth-service for a while, preventing cascading failures.
// Example of a basic circuit breaker
CircuitBreaker authCircuitBreaker = CircuitBreaker.ofDefaults("authService");
try {
String authResult = authCircuitBreaker.executeCallable(() -> {
// This is the call to auth-service
return restTemplate.getForObject("http://auth-service/authenticate", String.class);
});
// Process authResult
} catch (Exception e) {
// Handle circuit breaker tripped or auth-service error
log.error("Authentication failed", e);
}
This seems straightforward, but several common mistakes turn this safety net into a tripwire.
The "Too Aggressive" Timeout Antipattern
What it is: Setting the circuit breaker’s timeout for the downstream call to be shorter than the actual network or application timeout of the downstream service.
Why it’s bad: The circuit breaker will trip even when the downstream service could have eventually responded. It’s like slamming the brakes on a car that was just about to clear the intersection.
Diagnosis:
- Check the circuit breaker configuration for the timeout duration. In Resilience4j, this is
timeoutDuration. - Examine the downstream service’s own configuration for its request timeout (e.g.,
spring.mvc.async.request-timeoutin Spring Boot, or connection timeouts in the HTTP client library used by the downstream service). - Use network monitoring tools (like Wireshark or
tcpdump) to observe actual request/response times under load.
Fix: Increase the circuit breaker’s timeoutDuration to be at least equal to, and preferably slightly longer than, the downstream service’s timeout. For example, if auth-service has a 5-second timeout, set the circuit breaker’s timeout to 6 seconds.
// Resilience4j configuration example
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.timeoutDuration(Duration.ofSeconds(6)) // Increased from default 1s to 6s
.build();
CircuitBreaker authCircuitBreaker = CircuitBreaker.of("authService", circuitBreakerConfig);
Why it works: This allows the downstream call enough time to complete or fail with its own configured timeout, giving the circuit breaker a more accurate picture of the downstream service’s actual availability.
The "Too Quick to Re-open" Antipattern
What it is: Configuring the circuit breaker’s waitDuration (the time it stays open before attempting a half-open state) to be too short.
Why it’s bad: The circuit breaker re-opens the circuit and starts sending traffic to a struggling service again before that service has had sufficient time to recover. This can lead to repeated failures and a thrashing effect.
Diagnosis:
- Inspect the circuit breaker’s
waitDurationsetting (e.g.,waitIntervalin Resilience4j). - Observe the frequency of circuit breaker state transitions from
CLOSEDtoOPENand then back toHALF_OPENand potentiallyOPENagain in logs.
Fix: Increase the waitDuration. A common starting point is 30 seconds to 1 minute, but this should be tuned based on the expected recovery time of the downstream service. If the downstream service takes minutes to restart or stabilize, the waitDuration should reflect that.
// Resilience4j configuration example
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.waitDurationInOpenState(Duration.ofSeconds(30)) // Increased from default 5s to 30s
.build();
CircuitBreaker authCircuitBreaker = CircuitBreaker.of("authService", circuitBreakerConfig);
Why it works: It gives the downstream service a longer period to recover and stabilize before the circuit breaker allows even a single request to test its health.
The "Ignoring Failure Rate Threshold" Antipattern
What it is: Setting the failure rate threshold too high (e.g., 90% or 100%) or not understanding how it’s calculated.
Why it’s bad: The circuit breaker will only trip after an overwhelming number of requests have already failed. This means significant traffic has already hit the struggling service, increasing the likelihood of cascading failures and user impact.
Diagnosis:
- Check the circuit breaker’s
failureRateThresholdconfiguration. - Understand that the failure rate is often calculated over a sliding window of recent calls (e.g., the last 100 calls). If this window is too large, the threshold might not be reached quickly.
Fix: Lower the failureRateThreshold. A common starting point is 50%. Ensure the sliding window size (recordExceptions) is appropriate for your traffic patterns.
// Resilience4j configuration example
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureThreshold(50.0f) // Trips at 50% failure rate, down from default 50% but often misconfigured higher
.slidingWindowSize(10) // Smaller window to react faster to bursts
.build();
CircuitBreaker authCircuitBreaker = CircuitBreaker.of("authService", circuitBreakerConfig);
Why it works: A lower threshold and a smaller sliding window allow the circuit breaker to detect problems and open the circuit much earlier, protecting the downstream service and preventing widespread user impact.
The "Not Recording Enough Exceptions" Antipattern
What it is: The circuit breaker is configured to only consider specific exceptions (e.g., IOException) as failures, ignoring others that also indicate downstream unavailability (e.g., TimeoutException, HttpServerErrorException).
Why it’s bad: The circuit breaker won’t trip when the downstream service returns HTTP 5xx errors or specific client-side errors that still represent a failure.
Diagnosis:
- Examine the circuit breaker’s configuration for
recordExceptionsorignoreExceptions. - Check logs for downstream service errors that are not causing the circuit breaker to trip.
Fix: Ensure the circuit breaker is configured to record all exceptions that indicate a service failure. This often means including common HTTP error exceptions and network-related exceptions.
// Resilience4j configuration example
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.recordExceptions(
IOException.class,
ConnectTimeoutException.class,
ReadTimeoutException.class,
HttpStatusCodeException.class // Catches Spring's HTTP error exceptions
)
.build();
CircuitBreaker authCircuitBreaker = CircuitBreaker.of("authService", circuitBreakerConfig);
Why it works: This ensures that any problematic response from the downstream service, regardless of its specific exception type, is counted towards the failure rate, leading to a more accurate circuit breaker state.
The "Hardcoded Fallback" Antipattern
What it is: Implementing a fallback mechanism that always returns static, stale, or incomplete data when the circuit breaker trips.
Why it’s bad: While intended to provide a degraded experience, this can sometimes be worse than returning an error, as users might act on outdated information or be confused by irrelevant data.
Diagnosis:
- Observe the behavior of your application when the circuit breaker is open. What data is presented to the user?
- Review the fallback logic in your application code.
Fix: Design fallbacks that provide a clear "degraded service" indication, offer alternative actions, or return the last known good state with a timestamp and a warning. For critical data, it might be better to return a specific "service temporarily unavailable" message.
// Example of a fallback
authCircuitBreaker.executeCallable(() -> {
// ... call auth-service ...
}, throwable -> {
// Fallback logic
log.warn("Auth service unavailable, returning degraded response.");
return "DegradedAuthInfo"; // Or a more sophisticated fallback
});
Why it works: It manages user expectations and provides a more honest representation of the system’s current capabilities, preventing potential data integrity issues.
The "No Monitoring of Circuit Breaker State" Antipattern
What it is: Deploying circuit breakers without adequate monitoring and alerting on their state transitions.
Why it’s bad: You won’t know when a circuit breaker has tripped, how long it’s staying open, or if it’s constantly opening and closing (thrashing). This leaves you blind to systemic issues.
Diagnosis:
- Check your observability stack (e.g., Prometheus, Grafana, Datadog). Are there metrics for circuit breaker state (CLOSED, OPEN, HALF_OPEN)?
- Are there alerts configured for transitions to OPEN or for prolonged periods in OPEN?
Fix: Instrument your circuit breaker usage to expose metrics (e.g., using Micrometer with Resilience4j). Set up alerts for:
- Transitions from CLOSED to OPEN.
- Periods longer than a defined threshold (e.g., 5 minutes) in the OPEN state.
- Frequent transitions between OPEN and HALF_OPEN.
# Prometheus / Grafana example dashboard query
rate(circuitbreaker_state{state="OPEN"}[5m]) > 0
Why it works: Proactive monitoring allows you to investigate downstream service issues immediately, understand the impact of circuit breakers, and tune their configurations before they cause significant user-facing problems.
The next problem you’ll face is understanding how rate limiters interact with circuit breakers, and how misconfigurations in both can lead to unexpected throttling.