A circuit breaker doesn’t just trip; it actively learns about your system’s failure modes by counting specific events over time.

Let’s watch a circuit breaker in action. Imagine you have a service that calls out to an external API. This external API is a bit flaky. Sometimes it’s slow, sometimes it returns errors, and occasionally it just doesn’t respond at all. We want to prevent our service from hammering this flaky API when it’s clearly having issues, as that would just make our own service slow or unresponsive.

Here’s a simplified Go program using a hypothetical circuit breaker library:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

// Simulate a flaky external API
func callExternalAPI() error {
	// Simulate a 70% chance of success, 20% chance of timeout, 10% chance of error
	r := rand.Float64()
	if r < 0.7 {
		fmt.Println("API call successful!")
		return nil // Success
	} else if r < 0.9 {
		fmt.Println("API call timed out!")
		return fmt.Errorf("timeout") // Simulate timeout
	} else {
		fmt.Println("API call returned error!")
		return fmt.Errorf("internal server error") // Simulate error
	}
}

func main() {
	// Initialize a circuit breaker that trips after 5 failures in a 1-minute window
	// and stays open for 30 seconds.
	cb := circuitbreaker.New(
		circuitbreaker.WithFailureThreshold(5),
		circuitbreaker.WithWindowDuration(1*time.Minute),
		circuitbreaker.WithRethinkWaitDuration(30*time.Second),
	)

	for i := 0; i < 20; i++ {
		// Wrap the API call with the circuit breaker
		err := cb.Execute(func() error {
			return callExternalAPI()
		})

		if err != nil {
			log.Printf("Operation failed: %v", err)
		} else {
			log.Println("Operation succeeded.")
		}
		time.Sleep(5 * time.Second) // Simulate some time between calls
	}
}

In this example, cb.Execute is the core. When callExternalAPI is called, the circuit breaker is watching. If callExternalAPI returns an error (simulating a failure), the circuit breaker records it. It maintains a "sliding window" of recent operations, specifically focusing on failures.

Let’s break down the mental model. The circuit breaker has three main states:

  1. Closed: This is the normal state. Requests are allowed to pass through to the underlying operation. The breaker monitors the operation’s success and failure rates. If the failure rate exceeds a configured threshold within a defined time window, the breaker transitions to the Open state.

  2. Open: In this state, all requests are immediately rejected without executing the underlying operation. This prevents the system from continuing to call a failing dependency, thus protecting both the caller and the dependency. The breaker stays in this state for a configured RethinkWaitDuration.

  3. Half-Open: After the RethinkWaitDuration in the Open state has elapsed, the breaker transitions to Half-Open. In this state, it allows a single request to pass through to the operation. If this single request succeeds, the breaker transitions back to Closed. If it fails, the breaker immediately transitions back to Open, resetting the RethinkWaitDuration. This allows the system to cautiously test if the underlying dependency has recovered.

The "count-based sliding window" is the mechanism by which the breaker decides to move from Closed to Open. It’s not just about the total number of failures, but failures within a specific recent period.

Consider the configuration: circuitbreaker.WithFailureThreshold(5) and circuitbreaker.WithWindowDuration(1*time.Minute). This means the breaker will trip (go to Open) if it observes 5 or more failures within any continuous 1-minute interval. If a failure happens at 0:00, then another at 0:10, another at 0:20, another at 0:30, and another at 0:40, these 5 failures occurred within the window from 0:00 to 0:40 (which is less than a minute). If the next failure occurs at 1:10, and the first failure at 0:00 is now outside the 1-minute window (1:10 - 0:00 = 1:10), then that first failure no longer counts towards the threshold. The breaker is always looking at the most recent set of events.

The circuitbreaker.WithRethinkWaitDuration(30*time.Second) means that once the breaker trips, it will remain Open for at least 30 seconds before it even considers letting a single request through to test for recovery.

The one thing most people don’t realize is that the "window" isn’t a fixed calendar segment (like 1:00 PM to 2:00 PM). It’s a rolling window. If the breaker checks at 1:05 PM, its window is from 12:05 PM to 1:05 PM. If it checks again at 1:06 PM, the window is now from 12:06 PM to 1:06 PM. This ensures it’s always evaluating the most recent history of operations.

Once your circuit breaker is successfully allowing traffic through again after a period of being open, you’ll likely want to consider how to handle potential latency spikes during the transition from Half-Open to Closed.

Want structured learning?

Take the full Circuit-breaker course →