Circuit breakers are your distributed system’s emergency stop button, preventing cascading failures by isolating services that are misbehaving.
Let’s see one in action. Imagine we have a userService that calls an orderService. If orderService starts failing, we don’t want userService to keep hammering it, potentially taking down both services and anything else that depends on them.
// Example using a popular Go circuit breaker library
package main
import (
"fmt"
"net/http"
"time"
"github.com/sony/gobreaker"
)
var breaker *gobreaker.CircuitBreaker
func init() {
// Configure the circuit breaker
settings := gobreaker.Settings{
Name: "orderServiceBreaker",
// When to open the circuit: after 5 consecutive failures
OnStateChange: func(name string, from, to gobreaker.State) {
fmt.Printf("Circuit Breaker '%s' changed from %s to %s\n", name, from, to)
},
ReadyToTrip: func(counts gobreaker.Counts) bool {
// Trip if at least 10 requests have been made and 50% of them failed
return counts.Total >= 10 && float64(counts.Failure)/float64(counts.Total) >= 0.5
},
// How long the circuit stays open before trying again (half-open state)
Timeout: 30 * time.Second,
// How many successful requests in half-open state to close the circuit
MaxRequests: 3,
}
breaker = gobreaker.NewCircuitBreaker(settings)
}
func callOrderService() (string, error) {
// Attempt to execute the protected function
result, err := breaker.Execute(func() (interface{}, error) {
// This is the actual call to the orderService
resp, err := http.Get("http://localhost:8081/order")
if err != nil {
return nil, fmt.Errorf("failed to connect to order service: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("order service returned status: %d", resp.StatusCode)
}
// In a real scenario, you'd read the body
return "Order processed successfully", nil
})
if err != nil {
// If err is gobreaker.ErrOpenState, the circuit is open
if err == gobreaker.ErrOpenState {
return "", fmt.Errorf("circuit breaker is open, request to order service blocked: %w", err)
}
// Otherwise, it's an error from the actual function execution
return "", fmt.Errorf("order service call failed: %w", err)
}
return result.(string), nil
}
func main() {
// Simulate some calls
for i := 0; i < 15; i++ {
_, err := callOrderService()
if err != nil {
fmt.Printf("Call %d: %v\n", i, err)
} else {
fmt.Printf("Call %d: Success\n", i)
}
time.Sleep(1 * time.Second) // Small delay between calls
}
}
This callOrderService function wraps the actual HTTP request to an orderService. The gobreaker library, configured with specific Settings, monitors the calls. If too many calls fail within a certain window (ReadyToTrip) or consecutively, the breaker "opens." Once open, subsequent calls to callOrderService will immediately return an error (gobreaker.ErrOpenState) without even attempting to contact the orderService. After a Timeout period, the breaker enters a "half-open" state, allowing a limited number of test requests (MaxRequests). If these succeed, the breaker closes; otherwise, it opens again.
The core problem circuit breakers solve is failure propagation. In a microservices architecture, a single slow or failing service can starve its callers, which in turn starve their callers, leading to a system-wide outage. Circuit breakers act as a defense mechanism, allowing the failing service to recover and preventing clients from wasting resources on requests that are destined to fail. They give the downstream service breathing room.
Internally, a circuit breaker typically maintains counts of successful and failed requests. It operates in three states:
- Closed: Requests are allowed through to the service. If failures exceed a threshold, the breaker trips to Open.
- Open: Requests are immediately rejected with an error. After a timeout, it transitions to Half-Open.
- Half-Open: A limited number of test requests are allowed through. If successful, the breaker closes; if they fail, it re-opens.
The key levers you control are:
- Failure Threshold (
ReadyToTrip): How many failures (or what percentage of failures) trigger the open state. Setting this too low can cause unnecessary outages; too high can delay recovery. - Timeout (
Timeout): How long the circuit stays open before attempting to test recovery. Too short, and you might not give the downstream service enough time to stabilize; too long, and users experience unavailability for longer than necessary. - Max Requests in Half-Open (
MaxRequests): The number of test requests allowed in the half-open state. Too few, and a single fluke success might mask underlying issues; too many, and you risk re-opening the circuit prematurely. - Success Threshold in Half-Open: Some implementations also require a certain number of consecutive successes in the half-open state to close the circuit.
The OnStateChange callback is incredibly useful for observability. Seeing when a circuit opens or closes provides immediate insight into system health and potential issues. It’s not just about preventing damage; it’s about signaling when damage is occurring.
The real magic of circuit breakers is that they shift the burden of failure detection and handling from the client’s application logic to a dedicated, configurable component. This makes your overall system more resilient by default, as you can apply this pattern consistently across many inter-service communications without deeply embedding failure handling in every single API call.
When a circuit breaker is in the Open state, it doesn’t just return an error; it actively prevents the underlying operation from being invoked. This is crucial because repeatedly calling a failing service can exacerbate the problem, consuming network resources, CPU, and memory on both the client and server side, potentially leading to a complete system collapse. The breaker’s immediate rejection saves these resources.
The common pitfall is not configuring the Timeout correctly. If the timeout is too short, the breaker might transition to Half-Open and allow a few requests through before the downstream service has actually recovered, leading to immediate re-opening and a flapping circuit. Conversely, a very long timeout means users will face unavailability for an extended period even if the downstream service is quick to recover. The ideal timeout is often a balance informed by how long it typically takes for the dependency to recover from transient issues.
The next thing you’ll need to think about is how to handle the errors returned by the circuit breaker itself.