The most surprising thing about resilience patterns is how often they appear to be doing the same thing, but actually solve entirely different problems, and that you need both.
Let’s look at a common scenario: a microservice architecture where ServiceA calls ServiceB.
ServiceA (the client) makes a request to ServiceB (the server).
ServiceA --> ServiceB
If ServiceB is struggling, it might start returning errors, or taking a long time to respond. This is where things get interesting.
The Circuit Breaker Pattern
Imagine ServiceA is calling ServiceB repeatedly. If ServiceB starts failing, ServiceA keeps hammering it with requests. This is like trying to get a response from someone who’s clearly overwhelmed – you’re just making their situation worse, and wasting your own resources.
The Circuit Breaker pattern acts like an electrical circuit breaker. When ServiceB starts failing too often, the circuit breaker in ServiceA "trips." It stops sending requests to ServiceB for a period, instead immediately returning an error to its own callers. This gives ServiceB a chance to recover without being bombarded, and ServiceA avoids wasting time and resources on requests that are guaranteed to fail.
How it works:
- Closed State: Normal operation. Requests flow to
ServiceB. A counter tracks failures. - Open State: If the failure rate exceeds a threshold (e.g., 50% of requests fail in a 1-minute window), the breaker trips. All new requests to
ServiceBare immediately rejected with an error. - Half-Open State: After a timeout (e.g., 30 seconds), the breaker enters a half-open state. It allows a small number of test requests to
ServiceB. If these succeed, the breaker resets to Closed. If they fail, it returns to Open.
Example Configuration (Hystrix - a popular Java library):
HystrixCommandProperties.Setter()
.withCircuitBreakerEnabled(true)
.withCircuitBreakerRequestVolumeThreshold(10) // At least 10 requests to trip
.withCircuitBreakerSleepWindowInMilliseconds(10000) // 10 seconds in Open state
.withCircuitBreakerErrorThresholdPercentage(50); // 50% failure rate to trip
Why it works: It prevents cascading failures. By failing fast and giving the downstream service breathing room, it protects the overall system from a complete meltdown.
The Bulkhead Pattern
Now, imagine ServiceA is responsible for multiple distinct operations, say OperationX and OperationY, and both call ServiceB. What if OperationX starts experiencing a surge in load, and its requests to ServiceB are consuming all available resources in ServiceA (like threads or connection pools)? This can starve OperationY, even if ServiceB is perfectly healthy for OperationY’s requests.
The Bulkhead pattern is named after the watertight compartments in a ship. If one compartment floods, it doesn’t sink the whole ship. In software, it means isolating resources used by different parts of your application.
How it works:
You partition ServiceA’s resources based on the calls they make. For example, you might allocate a separate thread pool or connection pool for requests from OperationX to ServiceB, and another for OperationY to ServiceB.
Example Configuration (using separate thread pools):
If ServiceA uses a thread pool executor to manage its outgoing calls to ServiceB:
// Thread pool for OperationX calls to ServiceB
ThreadPoolExecutor operationXExecutor = new ThreadPoolExecutor(
10, // corePoolSize
10, // maximumPoolSize
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>(100) // workQueue
);
// Thread pool for OperationY calls to ServiceB
ThreadPoolExecutor operationYExecutor = new ThreadPoolExecutor(
10, // corePoolSize
10, // maximumPoolSize
60L, TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>(100) // workQueue
);
Why it works: It prevents a failure or overload in one part of your application (or one type of call) from impacting unrelated parts. If OperationX’s requests to ServiceB become slow and exhaust operationXExecutor’s threads, operationYExecutor’s threads remain available, allowing OperationY to continue processing requests to ServiceB (or other services) without interruption.
Why You Need Both
A Circuit Breaker protects a service from itself when a dependency is unhealthy. It stops the client from constantly hammering a failing server.
A Bulkhead protects different parts of your application from each other when one part is experiencing overload or its dependencies are failing. It ensures that a problem in one functional path doesn’t bring down others.
Consider this:
- If
ServiceBis unhealthy, the Circuit Breaker inServiceAwill trip for all calls toServiceB, regardless of which operation initiated them. This is good. - However, if
ServiceBis perfectly healthy, butOperationXinServiceAexperiences a massive, legitimate surge in traffic, it might exhaust the shared thread pool inServiceAused for all calls toServiceB. This would makeOperationYalso unable to reachServiceB, even thoughServiceBis fine. This is where Bulkhead shines.
The interplay: You might have a circuit breaker on calls to ServiceB. If ServiceB becomes unhealthy, the circuit breaker trips. If you also have bulkheads, the circuit breaker might trip for the specific thread pool associated with OperationX. However, if OperationX is just experiencing high load (and ServiceB is healthy), the bulkhead (separate thread pool) prevents OperationX’s load from impacting OperationY’s ability to call ServiceB.
This combination is crucial. The Circuit Breaker is the first line of defense against a failing dependency. The Bulkhead is the second line of defense, ensuring that even when a dependency is healthy, the internal resource management within your service is robust enough to handle varied loads and prevent internal contention.
If you fix all your circuit breaker issues and still see intermittent unresponsiveness, it’s very likely a resource contention problem that the bulkhead pattern will solve.