Hystrix circuit breakers don’t actually prevent all errors from happening; they prevent cascading failures by selectively letting errors through.
Let’s look at a simple Hystrix command. Imagine you have a service that fetches user data, and it calls an external dependency to get that data.
public class UserCommand extends HystrixCommand<User> {
private final String userId;
private final UserDependency userDependency;
public UserCommand(String userId, UserDependency userDependency) {
// Configure the command group and fallback
super(HystrixCommandGroupKey.Factory.asKey("UserGroup"));
this.userId = userId;
this.userDependency = userDependency;
}
@Override
protected User run() throws Exception {
// This is the actual call to the dependency
return userDependency.getUser(userId);
}
@Override
protected User getFallback() {
// This is what happens when the command fails or times out
System.out.println("Falling back for user: " + userId);
return new User("fallback_user_" + userId); // Provide a default or cached value
}
}
When you execute this command:
UserDependency dependency = // ... get your dependency instance
UserCommand command = new UserCommand("user123", dependency);
User user = command.execute(); // This triggers the circuit breaker logic
The run() method is where your actual logic lives. If userDependency.getUser(userId) throws an exception or takes too long (exceeds the configured timeout), Hystrix kicks in. First, it logs the error. If enough errors happen within a rolling window, the circuit breaker "opens." This means subsequent calls to UserCommand will immediately skip the run() method and go straight to getFallback(). After a cool-down period, Hystrix will allow a few "test" requests to go through to run(). If these succeed, the circuit breaker "closes" and normal operation resumes.
The core problem Hystrix solves is preventing a single slow or failing dependency from bringing down your entire application. Without it, a cascade of requests waiting for a hung service could exhaust thread pools, leading to unavailability for unrelated parts of your system. Hystrix isolates failures.
Here’s how you configure the circuit breaker behavior. These are typically set via HystrixCommandProperties or globally via HystrixPlugins.
execution.isolation.thread.timeoutInMilliseconds: This is the maximum time arun()method is allowed to execute before being considered a failure. A common value is1000milliseconds.circuitBreaker.requestVolumeThreshold: The minimum number of requests in a rolling window that Hystrix needs to see before it can even consider opening the circuit. A typical value might be20.circuitBreaker.sleepWindowInMilliseconds: How long the circuit breaker stays open before allowing a single test request.5000ms is a common starting point.circuitBreaker.errorThresholdPercentage: The percentage of failures within therequestVolumeThresholdthat will cause the circuit to open.50percent is often used.fallback.enabled: Whether to execute thegetFallback()method on failure. This is usuallytrue.
When you set circuitBreaker.requestVolumeThreshold to 20 and circuitBreaker.errorThresholdPercentage to 50, it means that if 10 out of the last 20 requests fail (either by exception or timeout), the circuit breaker will open.
The surprising thing is how aggressively Hystrix prioritizes immediate fallback over eventual success when a dependency is unhealthy. It’s not about trying to make the failing call eventually work; it’s about accepting that it won’t work right now and gracefully degrading the user experience instead.
Consider this scenario: a user service depends on a profile service. If the profile service starts returning 500 errors, Hystrix on the user service will detect this. It will keep circuitBreaker.requestVolumeThreshold requests in flight. If more than circuitBreaker.errorThresholdPercentage of those fail, the circuit opens. Now, any request to the user service that needs profile data will immediately hit the getFallback() method. The user might see a simplified profile or a "profile unavailable" message, but the user service itself remains responsive for other operations.
This immediate fallback is critical. If Hystrix didn’t open the circuit, the user service’s thread pool would eventually fill up with threads blocked waiting for the failing profile service. Those blocked threads would prevent any new requests from being processed by the user service, leading to a total outage.
The execution.isolation.strategy is also key. By default, it’s THREAD. This means run() executes on a separate thread managed by Hystrix. If you set it to SEMAPHORE, run() executes on the calling thread, but the semaphore limits concurrency. THREAD isolation is generally preferred for network calls as it prevents blocking the caller’s thread pool.
Once you’ve got your circuit breakers configured and your fallback logic in place, the next thing you’ll likely want to tackle is monitoring and alerting on the state of those breakers.