Circuit breakers and retries are both mechanisms for handling transient failures, but they address different failure modes and have fundamentally different impacts on system stability.

Let’s see how this plays out in a real-world scenario. Imagine a microservice, UserService, that needs to fetch user details from another service, ProfileService.

Here’s a simplified Go function for UserService making a request to ProfileService:

package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"time"
)

type UserProfile struct {
	UserID    string `json:"user_id"`
	FirstName string `json:"first_name"`
	LastName  string `json:"last_name"`
}

func getUserProfile(userID string) (*UserProfile, error) {
	url := fmt.Sprintf("http://localhost:8081/users/%s/profile", userID)
	resp, err := http.Get(url)
	if err != nil {
		return nil, fmt.Errorf("failed to connect to ProfileService: %w", err)
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		bodyBytes, _ := ioutil.ReadAll(resp.Body)
		return nil, fmt.Errorf("ProfileService returned status %d: %s", resp.StatusCode, string(bodyBytes))
	}

	var profile UserProfile
	if err := json.NewDecoder(resp.Body).Decode(&profile); err != nil {
		return nil, fmt.Errorf("failed to decode profile response: %w", err)
	}
	return &profile, nil
}

func main() {
	// Simulate a request
	profile, err := getUserProfile("user123")
	if err != nil {
		fmt.Printf("Error fetching profile: %v\n", err)
	} else {
		fmt.Printf("User Profile: %+v\n", profile)
	}
}

If ProfileService is temporarily unavailable (e.g., a network blip, or it’s restarting), getUserProfile will fail.

Retry is your first instinct. You might wrap the getUserProfile call like this:

func getUserProfileWithRetry(userID string, maxRetries int, delay time.Duration) (*UserProfile, error) {
	for i := 0; i < maxRetries; i++ {
		profile, err := getUserProfile(userID)
		if err == nil {
			return profile, nil // Success!
		}
		fmt.Printf("Attempt %d failed: %v. Retrying in %s...\n", i+1, err, delay)
		time.Sleep(delay)
	}
	return nil, fmt.Errorf("failed after %d retries", maxRetries)
}

func main() {
	profile, err := getUserProfileWithRetry("user123", 3, 2*time.Second) // Retry up to 3 times, wait 2s between
	if err != nil {
		fmt.Printf("Error fetching profile: %v\n", err)
	} else {
		fmt.Printf("User Profile: %+v\n", profile)
	}
}

This is great for transient issues that resolve quickly. If ProfileService is down for 5 seconds and you retry every 2 seconds, you’ll likely succeed. The ProfileService gets a brief moment to recover between your attempts.

However, what if ProfileService is permanently down, or severely overloaded? Your getUserProfileWithRetry function will hammer it with requests. Each retry is a new request, consuming resources on both UserService and ProfileService. If ProfileService is struggling, your retries make it worse, potentially causing a cascading failure where UserService also becomes unresponsive due to the load.

This is where Circuit Breaker shines. A circuit breaker doesn’t just blindly retry; it monitors the success rate of calls to a remote service.

Imagine UserService now uses a circuit breaker library (like sony/gobreaker in Go).

package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"time"

	"github.com/sony/gobreaker"
)

type UserProfile struct {
	UserID    string `json:"user_id"`
	FirstName string `json:"first_name"`
	LastName  string `json:"last_name"`
}

// Global circuit breaker instance
var breaker *gobreaker.Registry

func init() {
	// Configure the circuit breaker
	settings := gobreaker.Settings{
		Name: "ProfileService",
		// Counts requests that return an error.
		// If 50% of requests fail within a 1-minute window, the breaker trips.
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			failureRate := float64(counts.Failures) / float64(counts.Total) * 100
			return counts.Total >= 3 && failureRate >= 50 // Trip if >=3 requests and 50% failed
		},
		// After tripping, wait 30 seconds before allowing a single request (half-open state).
		Interval: 30 * time.Second,
		// If that single request succeeds, close the breaker. If it fails, open it again.
		OnStateChange: func(from, to gobreaker.State) {
			fmt.Printf("Circuit Breaker state changed from %s to %s\n", from, to)
		},
	}
	breaker = gobreaker.NewRegistry(settings)
}

func getUserProfileWithCircuitBreaker(userID string) (*UserProfile, error) {
	// Get a circuit breaker for the ProfileService operation
	cb, err := breaker.Get(userID) // Using userID as a unique key for this operation instance
	if err != nil {
		return nil, fmt.Errorf("failed to get circuit breaker: %w", err)
	}

	// Wrap the actual service call with the circuit breaker
	_, err = cb.Execute(func() (interface{}, error) {
		url := fmt.Sprintf("http://localhost:8081/users/%s/profile", userID)
		client := http.Client{Timeout: 5 * time.Second} // Add a timeout to the HTTP client too!
		resp, err := client.Get(url)
		if err != nil {
			return nil, fmt.Errorf("failed to connect to ProfileService: %w", err)
		}
		defer resp.Body.Close()

		if resp.StatusCode != http.StatusOK {
			bodyBytes, _ := ioutil.ReadAll(resp.Body)
			return nil, fmt.Errorf("ProfileService returned status %d: %s", resp.StatusCode, string(bodyBytes))
		}

		var profile UserProfile
		if err := json.NewDecoder(resp.Body).Decode(&profile); err != nil {
			return nil, fmt.Errorf("failed to decode profile response: %w", err)
		}
		return &profile, nil
	})

	if err != nil {
		// Check if the error is from the circuit breaker itself (meaning it's open)
		if err == gobreaker.ErrOpenState {
			return nil, fmt.Errorf("circuit breaker is open: %w", err)
		}
		// Otherwise, it's an error from the wrapped function (e.g., ProfileService error)
		return nil, fmt.Errorf("operation failed: %w", err)
	}

	// If cb.Execute returns nil error, the profile was successfully retrieved
	// We need to retrieve it from the result of cb.Execute.
	// In this specific gobreaker implementation, the result is the return value of the wrapped function.
	// A more robust implementation might store the result in a shared variable or return it differently.
	// For simplicity, let's assume the result is accessible or re-fetch it if needed.
	// For this example, let's re-fetch if err is nil (which is not ideal, but shows the concept)
	// A better approach would involve returning the profile from the wrapped function and casting.
	// Let's adjust the cb.Execute logic to return the profile directly.

	// Re-thinking the cb.Execute return for clarity:
	var result UserProfile
	_, err = cb.Execute(func() (interface{}, error) {
		url := fmt.Sprintf("http://localhost:8081/users/%s/profile", userID)
		client := http.Client{Timeout: 5 * time.Second}
		resp, err := client.Get(url)
		if err != nil {
			return nil, fmt.Errorf("failed to connect to ProfileService: %w", err)
		}
		defer resp.Body.Close()

		if resp.StatusCode != http.StatusOK {
			bodyBytes, _ := ioutil.ReadAll(resp.Body)
			return nil, fmt.Errorf("ProfileService returned status %d: %s", resp.StatusCode, string(bodyBytes))
		}

		var profile UserProfile
		if err := json.NewDecoder(resp.Body).Decode(&profile); err != nil {
			return nil, fmt.Errorf("failed to decode profile response: %w", err)
		}
		result = profile // Store the result
		return &profile, nil
	})

	if err != nil {
		if err == gobreaker.ErrOpenState {
			return nil, fmt.Errorf("circuit breaker is open: %w", err)
		}
		return nil, fmt.Errorf("operation failed: %w", err)
	}

	return &result, nil // Return the successfully retrieved profile
}

func main() {
	// Simulate a request
	profile, err := getUserProfileWithCircuitBreaker("user123")
	if err != nil {
		fmt.Printf("Error fetching profile: %v\n", err)
	} else {
		fmt.Printf("User Profile: %+v\n", profile)
	}
}

With a circuit breaker, when ProfileService starts failing repeatedly, the breaker will eventually "trip" (open). Once open, it immediately returns an error (like gobreaker.ErrOpenState) without even attempting to call ProfileService. This protects ProfileService from further load and prevents UserService from wasting resources on futile requests. After a configured Interval (e.g., 30 seconds), it enters a "half-open" state, allowing a single request. If that request succeeds, the breaker closes; if it fails, it opens again.

When to use which:

  • Retry: Use when failures are expected to be short-lived and intermittent, and you want to give the downstream service a brief window to recover. Think of network glitches or very brief service restarts. The key is that the downstream service is not overloaded.
  • Circuit Breaker: Use when a service might be experiencing prolonged outages, severe performance degradation, or when you want to prevent cascading failures. It’s a safety net that stops you from overwhelming a struggling dependency. It also provides immediate feedback (via the ErrOpenState) that the dependency is down, rather than waiting for all retries to exhaust.

Crucially, you can often combine them. A common pattern is to use a circuit breaker for the outer layer of protection, and if the circuit breaker is closed (allowing requests), you might then apply retries within the circuit breaker’s allowed calls for truly transient, quick-fixable issues. The circuit breaker prevents the retry storm, and the retry handles minor hiccups when the service is otherwise healthy.

The most surprising true thing about circuit breakers is that their primary purpose isn’t to fix the downstream service, but to protect the caller and the broader system from the impact of the downstream service’s failure.

The problem this solves is graceful degradation. Instead of a failing ProfileService bringing down UserService and potentially other services that depend on UserService, the circuit breaker allows UserService to respond quickly (even if with an error) and release resources.

Internally, a circuit breaker maintains a state machine: Closed, Open, and Half-Open. In the Closed state, requests flow through. In the Open state, requests are immediately rejected. In the Half-Open state, a limited number of requests are allowed through to test if the downstream service has recovered. The transitions between these states are governed by rules based on success/failure counts and time windows.

The exact levers you control are the ReadyToTrip function (defining when to open the circuit based on failure rates and request counts), the Interval (how long to stay open before trying again), and the Timeout on the HTTP client itself (preventing individual requests from hanging indefinitely).

What most people don’t realize is that the gobreaker.Registry and gobreaker.CircuitBreaker types in the sony/gobreaker library are distinct. The Registry manages multiple circuit breakers (e.g., one for each downstream service or even each type of operation), while Get(key) retrieves or creates a specific CircuitBreaker instance for a given key (like a service name or user ID). This allows for fine-grained control and isolation of failures.

The next concept you’ll likely grapple with is how to handle the errors returned by a circuit breaker more intelligently, perhaps by falling back to cached data or a default response.

Want structured learning?

Take the full Circuit-breaker course →