Chaos engineering isn’t about breaking things randomly; it’s about deliberately testing your system’s resilience by introducing failures and observing how it responds, especially when critical components like circuit breakers are involved.
Imagine you have a microservice, user-service, that depends on payment-service. If payment-service becomes slow or unresponsive, user-service could get bogged down, leading to cascading failures. Circuit breakers are designed to prevent this. When user-service notices payment-service is failing too often, it "opens" the circuit, stopping further calls to payment-service for a while and immediately returning an error or a fallback response. After a timeout, it enters a "half-open" state, allowing a few test calls. If those succeed, the circuit "closes" again.
Let’s simulate a payment-service failure and see how a circuit breaker in user-service (using Hystrix, a popular Java library, as an example) reacts.
Here’s a simplified user-service dependency on payment-service:
// User Service
@Service
public class UserService {
private final RestTemplate restTemplate;
private final PaymentServiceFallback paymentServiceFallback; // A fallback implementation
// Annotate the method that calls payment-service with HystrixCommand
@HystrixCommand(
fallbackMethod = "reliablePayment", // The method to call if the circuit is open
commandProperties = {
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"), // Min requests before checking status
@HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000"), // Time to wait in half-open state
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50") // Error rate to open the circuit
}
)
public PaymentDetails processPayment(String userId, PaymentRequest request) {
// Actual call to payment-service
return restTemplate.postForObject("http://payment-service/process", request, PaymentDetails.class);
}
// Fallback method
public PaymentDetails reliablePayment(String userId, PaymentRequest request) {
System.out.println("Payment service is down. Using fallback.");
return paymentServiceFallback.processOffline(userId, request);
}
}
Now, let’s introduce a failure. We can achieve this in a few ways:
-
Network Level Blockage: Use
iptableson thepayment-servicehost to drop all incoming traffic on port 8080 (assumingpayment-serviceruns on this port).# On the payment-service host sudo iptables -A INPUT -p tcp --dport 8080 -j DROPThis will cause
RestTemplatecalls fromuser-serviceto time out. -
Application Level Slowdown/Error: If you have control over the
payment-servicedeployment, you can:- Introduce Artificial Latency: Modify the
payment-servicetoThread.sleep(10000)before processing requests. - Return Errors: Configure
payment-serviceto randomly return HTTP 500 errors.
- Introduce Artificial Latency: Modify the
-
Service Discovery Failure: If you use a service discovery mechanism like Eureka or Consul, you can temporarily "deregister" the
payment-serviceinstance. This makes it unavailable touser-service.- Eureka Example: Access the Eureka dashboard, find the
payment-serviceinstance, and click "Cancel" or "Evict."
- Eureka Example: Access the Eureka dashboard, find the
Once user-service starts making calls to payment-service and they begin failing (due to network block, slowdown, or service discovery issue), Hystrix will start tracking these failures.
- Error Tracking: Hystrix monitors the success/failure rate of calls within a rolling window. The
circuitBreaker.requestVolumeThreshold(set to 10 here) means at least 10 requests must occur before Hystrix even considers opening the circuit. - Circuit Opening: If more than 50% (
circuitBreaker.errorThresholdPercentage) of those 10 requests fail, the circuit breaker trips and moves to the "OPEN" state. - Immediate Fallback: From this point on, any call to
processPaymentwill not go topayment-service. Instead, Hystrix will immediately execute thereliablePaymentmethod. You’ll see output like:Payment service is down. Using fallback. - Half-Open State: After
circuitBreaker.sleepWindowInMilliseconds(5000 ms, or 5 seconds) has passed, the circuit breaker transitions to "HALF-OPEN." It will allow a single request topayment-service.- If this request succeeds, the circuit breaker resets and moves back to "CLOSED."
- If this request fails, the circuit breaker immediately returns to "OPEN," and the
sleepWindowInMillisecondstimer restarts.
To verify this, you can expose Hystrix’s monitoring endpoints (e.g., /hystrix.stream or use the Hystrix Dashboard). You’ll see the state changes: CLOSED -> OPEN -> HALF-OPEN -> CLOSED (or OPEN again).
To clean up the iptables rule:
# On the payment-service host
sudo iptables -D INPUT -p tcp --dport 8080 -j DROP
The most counterintuitive aspect of circuit breaker testing is that you’re not just testing the breaker itself, but the entire failure path. This includes network resilience, timeouts configured in your RestTemplate (which Hystrix relies on to detect failure), and the efficacy of your fallback mechanisms. A robust circuit breaker configuration is useless if your underlying HTTP client is configured to wait 60 seconds before timing out, negating the breaker’s ability to react quickly.
The next logical step after verifying your circuit breakers is to test scenarios where the fallback mechanism itself fails.