A circuit breaker doesn’t just prevent a single service from being overloaded; it actively prevents the entire system from collapsing when one component starts to choke.

Let’s look at what happens when a service, say OrderService, depends on another, PaymentService.

// Example of a typical request flow
POST /orders
{
  "items": [...],
  "payment_method": "credit_card",
  "payment_details": {
    "card_number": "...",
    "expiry": "...",
    "cvv": "..."
  }
}

When OrderService receives this request, it needs to call PaymentService to authorize the payment.

// Simplified OrderService code
public Order createOrder(OrderRequest request) {
    // ... validate order ...

    PaymentResult paymentResult = paymentServiceClient.authorize(request.getPaymentDetails()); // This is the critical call

    if (paymentResult.isSuccess()) {
        // ... save order, confirm payment ...
        return new Order(...);
    } else {
        // ... handle payment failure ...
        throw new PaymentFailedException("Payment authorization failed.");
    }
}

Now, imagine PaymentService is experiencing a surge in traffic or a bug that makes it slow to respond.

If OrderService keeps making calls to a struggling PaymentService, something bad happens:

  1. Resource Exhaustion in OrderService: Each blocked call to PaymentService ties up a thread, a connection, or memory in OrderService. If PaymentService is slow, OrderService starts filling up with these waiting requests.
  2. Timeout and Retries: OrderService might have a timeout set. If PaymentService doesn’t respond within, say, 500ms, the call fails. OrderService might then retry, possibly with exponential backoff.
  3. The Avalanche: If PaymentService is consistently slow or unavailable, OrderService will be swamped with failed or timed-out requests. It can no longer process new legitimate orders, even for things that don’t require PaymentService (e.g., creating an order with a "pay later" option).
  4. Cascading Failure: Now, other services that depend on OrderService (like NotificationService or InventoryService) start failing because OrderService is unresponsive. The failure spreads like wildfire.

This is where the circuit breaker pattern comes in. It’s like an electrical circuit breaker: when there’s too much current (too many failures), it "trips" and stops the flow of electricity (requests) to prevent damage.

How it Works: The Three States

A circuit breaker lives within the calling service (OrderService in our example) and wraps the calls to the problematic dependency (PaymentService). It has three states:

  1. Closed: This is the normal state. Requests are allowed to flow to the downstream service. The breaker monitors for failures.
  2. Open: If the number of failures (or a failure rate) exceeds a certain threshold within a time window, the breaker "trips" and enters the Open state. In this state, all new requests to the downstream service are immediately rejected without even attempting to call it. This prevents the calling service from wasting resources on a failing dependency.
  3. Half-Open: After a configured timeout period in the Open state, the breaker enters the Half-Open state. It allows a single test request to pass through to the downstream service.
    • If this test request succeeds, the breaker assumes the downstream service has recovered and transitions back to Closed.
    • If this test request fails, the breaker immediately returns to the Open state, resetting the timeout.

Implementing Circuit Breakers

You’ll typically use a library for this. Popular choices include Resilience4j (Java), Polly (.NET), and Hystrix (Java, though largely superseded by Resilience4j). Let’s illustrate with Resilience4j concepts.

Imagine PaymentService is exposed via a REST client.

// Using Resilience4j's CircuitBreaker
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;
import java.time.Duration;

// --- Configuration ---
CircuitBreakerConfig circuitBreakerConfig = CircuitBreakerConfig.custom()
    .failureRateThreshold(50) // Trip if 50% of calls fail
    .waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open for 30 seconds
    .permittedNumberOfCallsInHalfOpenState(2) // Allow 2 calls in Half-Open state
    .build();

CircuitBreakerRegistry circuitBreakerRegistry = CircuitBreakerRegistry.of(circuitBreakerConfig);
CircuitBreaker paymentServiceCircuitBreaker = circuitBreakerRegistry.circuitBreaker("paymentService");

// --- Usage in OrderService ---
public PaymentResult authorizePayment(PaymentDetails details) {
    // Wrap the call to the actual payment service client
    return paymentServiceCircuitBreaker.executeCallable(() -> {
        // This is the actual call to PaymentService
        // If this throws an exception, the circuit breaker counts it as a failure
        return actualPaymentServiceClient.authorize(details);
    });
}

Common Causes of Circuit Breaker Tripping (and How to Fix Them):

  1. Downstream Service Overload/Slowdown:

    • Diagnosis: Monitor the metrics of PaymentService. Look for increased latency, high CPU/memory usage, or slow database queries. Check OrderService’s circuit breaker metrics – is it frequently transitioning to Open?
    • Fix: Scale up PaymentService (more instances, more powerful hardware), optimize its code/database queries, or implement rate limiting on PaymentService itself to prevent overload.
    • Why it Works: Reduces the load on PaymentService, allowing it to respond faster and below the failure rate threshold.
  2. Network Issues Between Services:

    • Diagnosis: Check network connectivity, firewall rules, and latency between OrderService and PaymentService. Look for packet loss.
    • Fix: Resolve network configuration errors, improve network infrastructure, or ensure services are deployed in the same low-latency network zone.
    • Why it Works: Ensures requests reliably reach PaymentService and responses return promptly, preventing timeouts.
  3. Database Bottlenecks in PaymentService:

    • Diagnosis: Analyze PaymentService’s database performance. Look for slow queries, high I/O wait, or contention.
    • Fix: Optimize database indexes, tune query performance, upgrade database hardware, or implement read replicas for PaymentService.
    • Why it Works: Faster database operations mean PaymentService can process requests more quickly, reducing latency and failure rates.
  4. Incorrect Timeout Configuration:

    • Diagnosis: If PaymentService is actually healthy but takes slightly longer than the OrderService’s configured timeout (e.g., 200ms), the breaker will trip. Check OrderService’s client timeout settings and the waitDurationInOpenState of the circuit breaker.
    • Fix: Increase the client timeout in OrderService to be slightly longer than PaymentService’s typical response time. Adjust waitDurationInOpenState to a reasonable value (e.g., 30-60 seconds).
    • Why it Works: Allows legitimate, albeit slightly slower, requests to complete successfully, and prevents premature tripping.
  5. Bugs in PaymentService Causing Exceptions:

    • Diagnosis: Examine PaymentService logs for recurring exceptions, especially those not handled gracefully. Ensure the circuit breaker is configured to count the right types of exceptions as failures.
    • Fix: Fix the bugs in PaymentService. For temporary issues or specific error types that shouldn’t trip the breaker, configure the circuit breaker to ignore certain exceptions.
    • Why it Works: Resolves the root cause of failures, allowing PaymentService to become stable.
  6. Misconfigured Circuit Breaker Thresholds:

    • Diagnosis: The failureRateThreshold might be too low (e.g., 10%), causing the breaker to trip on transient blips. The permittedNumberOfCallsInHalfOpenState might be too low, preventing recovery.
    • Fix: Increase failureRateThreshold (e.g., to 50%). Increase permittedNumberOfCallsInHalfOpenState (e.g., to 5 or 10) to give the service more chances to prove it’s recovered in Half-Open state.
    • Why it Works: Makes the breaker more resilient to minor, temporary issues and more likely to reopen the circuit when the downstream service is stable.

After fixing the root cause of the PaymentService issues and ensuring your circuit breaker is configured appropriately, the next error you might encounter is a TooManyRequestsException if you haven’t also implemented rate limiting on your own services.

Want structured learning?

Take the full Circuit-breaker course →