A circuit breaker doesn’t just prevent a single service from being overloaded; it actively prevents the entire system from collapsing when one component starts to choke.
Let’s look at what happens when a service, say OrderService, depends on another, PaymentService.
// Example of a typical request flow
POST /orders
{
"items": [...],
"payment_method": "credit_card",
"payment_details": {
"card_number": "...",
"expiry": "...",
"cvv": "..."
}
}
When OrderService receives this request, it needs to call PaymentService to authorize the payment.
// Simplified OrderService code
public Order createOrder(OrderRequest request) {
// ... validate order ...
PaymentResult paymentResult = paymentServiceClient.authorize(request.getPaymentDetails()); // This is the critical call
if (paymentResult.isSuccess()) {
// ... save order, confirm payment ...
return new Order(...);
} else {
// ... handle payment failure ...
throw new PaymentFailedException("Payment authorization failed.");
}
}
Now, imagine PaymentService is experiencing a surge in traffic or a bug that makes it slow to respond.
If OrderService keeps making calls to a struggling PaymentService, something bad happens:
- Resource Exhaustion in
OrderService: Each blocked call toPaymentServiceties up a thread, a connection, or memory inOrderService. IfPaymentServiceis slow,OrderServicestarts filling up with these waiting requests. - Timeout and Retries:
OrderServicemight have a timeout set. IfPaymentServicedoesn’t respond within, say, 500ms, the call fails.OrderServicemight then retry, possibly with exponential backoff. - The Avalanche: If
PaymentServiceis consistently slow or unavailable,OrderServicewill be swamped with failed or timed-out requests. It can no longer process new legitimate orders, even for things that don’t requirePaymentService(e.g., creating an order with a "pay later" option). - Cascading Failure: Now, other services that depend on
OrderService(likeNotificationServiceorInventoryService) start failing becauseOrderServiceis unresponsive. The failure spreads like wildfire.
This is where the circuit breaker pattern comes in. It’s like an electrical circuit breaker: when there’s too much current (too many failures), it "trips" and stops the flow of electricity (requests) to prevent damage.
How it Works: The Three States
A circuit breaker lives within the calling service (OrderService in our example) and wraps the calls to the problematic dependency (PaymentService). It has three states:
- Closed: This is the normal state. Requests are allowed to flow to the downstream service. The breaker monitors for failures.
- Open: If the number of failures (or a failure rate) exceeds a certain threshold within a time window, the breaker "trips" and enters the Open state. In this state, all new requests to the downstream service are immediately rejected without even attempting to call it. This prevents the calling service from wasting resources on a failing dependency.
- Half-Open: After a configured timeout period in the Open state, the breaker enters the Half-Open state. It allows a single test request to pass through to the downstream service.
- If this test request succeeds, the breaker assumes the downstream service has recovered and transitions back to Closed.
- If this test request fails, the breaker immediately returns to the Open state, resetting the timeout.
Implementing Circuit Breakers
You’ll typically use a library for this. Popular choices include Resilience4j (Java), Polly (.NET), and Hystrix (Java, though largely superseded by Resilience4j). Let’s illustrate with Resilience4j concepts.
Imagine PaymentService is exposed via a REST client.
// Using Resilience4j's CircuitBreaker
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;
// --- Configuration ---
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // Trip if 50% of calls fail
.waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open for 30 seconds
.permittedNumberOfCallsInHalfOpenState(2) // Allow 2 calls in Half-Open state
.build();
CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker paymentServiceCircuitBreaker = circuitBreakerRegistry.circuitBreaker("paymentService");
// --- Usage in OrderService ---
public PaymentResult authorizePayment(PaymentDetails details) {
// Wrap the call to the actual payment service client
return paymentServiceCircuitBreaker.executeCallable(() -> {
// This is the actual call to PaymentService
// If this throws an exception, the circuit breaker counts it as a failure
return actualPaymentServiceClient.authorize(details);
});
}
Common Causes of Circuit Breaker Tripping (and How to Fix Them):
-
Downstream Service Overload/Slowdown:
- Diagnosis: Monitor the metrics of
PaymentService. Look for increased latency, high CPU/memory usage, or slow database queries. CheckOrderService’s circuit breaker metrics – is it frequently transitioning to Open? - Fix: Scale up
PaymentService(more instances, more powerful hardware), optimize its code/database queries, or implement rate limiting onPaymentServiceitself to prevent overload. - Why it Works: Reduces the load on
PaymentService, allowing it to respond faster and below the failure rate threshold.
- Diagnosis: Monitor the metrics of
-
Network Issues Between Services:
- Diagnosis: Check network connectivity, firewall rules, and latency between
OrderServiceandPaymentService. Look for packet loss. - Fix: Resolve network configuration errors, improve network infrastructure, or ensure services are deployed in the same low-latency network zone.
- Why it Works: Ensures requests reliably reach
PaymentServiceand responses return promptly, preventing timeouts.
- Diagnosis: Check network connectivity, firewall rules, and latency between
-
Database Bottlenecks in
PaymentService:- Diagnosis: Analyze
PaymentService’s database performance. Look for slow queries, high I/O wait, or contention. - Fix: Optimize database indexes, tune query performance, upgrade database hardware, or implement read replicas for
PaymentService. - Why it Works: Faster database operations mean
PaymentServicecan process requests more quickly, reducing latency and failure rates.
- Diagnosis: Analyze
-
Incorrect Timeout Configuration:
- Diagnosis: If
PaymentServiceis actually healthy but takes slightly longer than theOrderService’s configured timeout (e.g., 200ms), the breaker will trip. CheckOrderService’s client timeout settings and thewaitDurationInOpenStateof the circuit breaker. - Fix: Increase the client timeout in
OrderServiceto be slightly longer thanPaymentService’s typical response time. AdjustwaitDurationInOpenStateto a reasonable value (e.g., 30-60 seconds). - Why it Works: Allows legitimate, albeit slightly slower, requests to complete successfully, and prevents premature tripping.
- Diagnosis: If
-
Bugs in
PaymentServiceCausing Exceptions:- Diagnosis: Examine
PaymentServicelogs for recurring exceptions, especially those not handled gracefully. Ensure the circuit breaker is configured to count the right types of exceptions as failures. - Fix: Fix the bugs in
PaymentService. For temporary issues or specific error types that shouldn’t trip the breaker, configure the circuit breaker to ignore certain exceptions. - Why it Works: Resolves the root cause of failures, allowing
PaymentServiceto become stable.
- Diagnosis: Examine
-
Misconfigured Circuit Breaker Thresholds:
- Diagnosis: The
failureRateThresholdmight be too low (e.g., 10%), causing the breaker to trip on transient blips. ThepermittedNumberOfCallsInHalfOpenStatemight be too low, preventing recovery. - Fix: Increase
failureRateThreshold(e.g., to 50%). IncreasepermittedNumberOfCallsInHalfOpenState(e.g., to 5 or 10) to give the service more chances to prove it’s recovered in Half-Open state. - Why it Works: Makes the breaker more resilient to minor, temporary issues and more likely to reopen the circuit when the downstream service is stable.
- Diagnosis: The
After fixing the root cause of the PaymentService issues and ensuring your circuit breaker is configured appropriately, the next error you might encounter is a TooManyRequestsException if you haven’t also implemented rate limiting on your own services.