Chaos engineering isn’t about breaking things randomly; it’s about deliberately testing your system’s resilience by introducing failures and observing how it responds, especially when critical components like circuit breakers are involved.

Imagine you have a microservice, user-service, that depends on payment-service. If payment-service becomes slow or unresponsive, user-service could get bogged down, leading to cascading failures. Circuit breakers are designed to prevent this. When user-service notices payment-service is failing too often, it "opens" the circuit, stopping further calls to payment-service for a while and immediately returning an error or a fallback response. After a timeout, it enters a "half-open" state, allowing a few test calls. If those succeed, the circuit "closes" again.

Let’s simulate a payment-service failure and see how a circuit breaker in user-service (using Hystrix, a popular Java library, as an example) reacts.

Here’s a simplified user-service dependency on payment-service:

// User Service
@Service
public class UserService {

    private final RestTemplate restTemplate;
    private final PaymentServiceFallback paymentServiceFallback; // A fallback implementation

    // Annotate the method that calls payment-service with HystrixCommand
    @HystrixCommand(
        fallbackMethod = "reliablePayment", // The method to call if the circuit is open
        commandProperties = {
            @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"), // Min requests before checking status
            @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000"), // Time to wait in half-open state
            @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50") // Error rate to open the circuit
        }
    )
    public PaymentDetails processPayment(String userId, PaymentRequest request) {
        // Actual call to payment-service
        return restTemplate.postForObject("http://payment-service/process", request, PaymentDetails.class);
    }

    // Fallback method
    public PaymentDetails reliablePayment(String userId, PaymentRequest request) {
        System.out.println("Payment service is down. Using fallback.");
        return paymentServiceFallback.processOffline(userId, request);
    }
}

Now, let’s introduce a failure. We can achieve this in a few ways:

  1. Network Level Blockage: Use iptables on the payment-service host to drop all incoming traffic on port 8080 (assuming payment-service runs on this port).

    # On the payment-service host
    sudo iptables -A INPUT -p tcp --dport 8080 -j DROP
    

    This will cause RestTemplate calls from user-service to time out.

  2. Application Level Slowdown/Error: If you have control over the payment-service deployment, you can:

    • Introduce Artificial Latency: Modify the payment-service to Thread.sleep(10000) before processing requests.
    • Return Errors: Configure payment-service to randomly return HTTP 500 errors.
  3. Service Discovery Failure: If you use a service discovery mechanism like Eureka or Consul, you can temporarily "deregister" the payment-service instance. This makes it unavailable to user-service.

    • Eureka Example: Access the Eureka dashboard, find the payment-service instance, and click "Cancel" or "Evict."

Once user-service starts making calls to payment-service and they begin failing (due to network block, slowdown, or service discovery issue), Hystrix will start tracking these failures.

  • Error Tracking: Hystrix monitors the success/failure rate of calls within a rolling window. The circuitBreaker.requestVolumeThreshold (set to 10 here) means at least 10 requests must occur before Hystrix even considers opening the circuit.
  • Circuit Opening: If more than 50% (circuitBreaker.errorThresholdPercentage) of those 10 requests fail, the circuit breaker trips and moves to the "OPEN" state.
  • Immediate Fallback: From this point on, any call to processPayment will not go to payment-service. Instead, Hystrix will immediately execute the reliablePayment method. You’ll see output like: Payment service is down. Using fallback.
  • Half-Open State: After circuitBreaker.sleepWindowInMilliseconds (5000 ms, or 5 seconds) has passed, the circuit breaker transitions to "HALF-OPEN." It will allow a single request to payment-service.
    • If this request succeeds, the circuit breaker resets and moves back to "CLOSED."
    • If this request fails, the circuit breaker immediately returns to "OPEN," and the sleepWindowInMilliseconds timer restarts.

To verify this, you can expose Hystrix’s monitoring endpoints (e.g., /hystrix.stream or use the Hystrix Dashboard). You’ll see the state changes: CLOSED -> OPEN -> HALF-OPEN -> CLOSED (or OPEN again).

To clean up the iptables rule:

# On the payment-service host
sudo iptables -D INPUT -p tcp --dport 8080 -j DROP

The most counterintuitive aspect of circuit breaker testing is that you’re not just testing the breaker itself, but the entire failure path. This includes network resilience, timeouts configured in your RestTemplate (which Hystrix relies on to detect failure), and the efficacy of your fallback mechanisms. A robust circuit breaker configuration is useless if your underlying HTTP client is configured to wait 60 seconds before timing out, negating the breaker’s ability to react quickly.

The next logical step after verifying your circuit breakers is to test scenarios where the fallback mechanism itself fails.

Want structured learning?

Take the full Circuit-breaker course →