A surprising number of systems that appear to be failing are actually just overloaded, and the most effective way to handle this is to deliberately slow down.

Let’s see this in action with a simple scenario: a service that processes incoming requests, but sometimes fails if it gets too many too fast.

Imagine we have a request_processor service and an api_gateway. The api_gateway receives requests and forwards them to request_processor.

// api_gateway.go
package main

import (
	"fmt"
	"math/rand"
	"net/http"
	"time"
)

func main() {
	http.HandleFunc("/process", handler)
	fmt.Println("Starting API Gateway on :8080")
	http.ListenAndServe(":8080", nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	// Simulate network latency and potential failure
	delay := time.Duration(rand.Intn(500)) * time.Millisecond
	time.Sleep(delay)

	// Simulate a failure rate that increases with request volume
	if rand.Intn(100) < 20 { // 20% chance of failure
		http.Error(w, "Internal Server Error", http.StatusInternalServerError)
		return
	}

	fmt.Fprintf(w, "Request processed successfully!")
}

Now, if we hit /process repeatedly, we’ll see some successes and some 500 Internal Server Error responses. This is the request_processor getting overwhelmed.

The problem is that when the api_gateway sees a failure, its default behavior is to immediately try again. If the request_processor is failing because it’s overloaded, repeatedly hammering it with requests only makes the overload worse. This is a classic thundering herd problem.

This is where retries with exponential backoff and circuit breakers come in.

Retries with Exponential Backoff

Instead of retrying immediately, we wait. The first retry happens after a short delay (e.g., 100ms). If that fails, the next retry waits longer (e.g., 200ms), then 400ms, 800ms, and so on. This "exponential" increase in delay gives the overloaded service time to recover.

Here’s how we can modify the api_gateway to implement this. We’ll use a simple retry mechanism.

// api_gateway_with_retry.go
package main

import (
	"fmt"
	"math/rand"
	"net/http"
	"time"
)

const maxRetries = 3
const initialBackoff = 100 * time.Millisecond
const backoffMultiplier = 2

func main() {
	http.HandleFunc("/process", handler)
	fmt.Println("Starting API Gateway with Retry on :8080")
	http.ListenAndServe(":8080", nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	var resp *http.Response
	var err error
	backoff := initialBackoff

	for i := 0; i <= maxRetries; i++ {
		// Simulate network latency and potential failure
		delay := time.Duration(rand.Intn(500)) * time.Millisecond
		time.Sleep(delay)

		// Simulate a failure rate that increases with request volume
		if rand.Intn(100) < 20 { // 20% chance of failure
			err = fmt.Errorf("simulated internal server error")
		} else {
			fmt.Fprintf(w, "Request processed successfully!")
			return // Success!
		}

		// If we're here, it failed. Log and prepare for retry if not the last attempt.
		fmt.Printf("Attempt %d failed: %v. Retrying in %s...\n", i+1, err, backoff)
		if i < maxRetries {
			time.Sleep(backoff)
			backoff *= backoffMultiplier
		}
	}

	// If all retries failed
	http.Error(w, fmt.Sprintf("Failed after %d retries: %v", maxRetries, err), http.StatusInternalServerError)
}

Now, when the request_processor is struggling, the api_gateway will pause between attempts, giving the backend a chance to catch up. This significantly increases the overall success rate by avoiding sustained overload.

Circuit Breakers

Retries are great, but what if the service is permanently down, or will be for a long time? Retrying indefinitely, even with backoff, is still a waste of resources and can delay other, potentially successful, requests. This is where circuit breakers shine.

A circuit breaker acts like an electrical circuit breaker. It monitors calls to a remote service. If too many calls fail within a given period, it "opens" the circuit, and subsequent calls are immediately rejected without even attempting to contact the service. This prevents the client from wasting time and resources on a failing service and gives the failing service a break. After a timeout period, the breaker "half-opens" and allows a few test calls. If those succeed, it "closes" the circuit again; if they fail, it re-opens.

Let’s add a circuit breaker. We’ll use a popular Go library, sony/gobreaker, for this.

First, install it: go get github.com/sony/gobreaker

// api_gateway_with_circuitbreaker.go
package main

import (
	"fmt"
	"log"
	"math/rand"
	"net/http"
	"time"

	"github.com/sony/gobreaker"
)

const maxRetries = 3
const initialBackoff = 100 * time.Millisecond
const backoffMultiplier = 2

// Configure the circuit breaker
var breaker *gobreaker.CircuitBreaker

func init() {
	settings := gobreaker.Settings{
		Name:        "request_processor_breaker",
		MaxRequests: 3, // Number of requests allowed in half-open state
		Interval:    5 * time.Second, // Time until transitioning from open to half-open
		Timeout:     10 * time.Second, // Time until closing the circuit
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			// Trip the breaker if 50% of requests in the last window failed
			failureRate := float64(counts.Failures) / float64(counts.Total)
			return counts.Total >= 3 && failureRate >= 0.5
		},
		OnStateChange: func(name string, from, to gobreaker.State) {
			log.Printf("Circuit Breaker '%s' changed from %s to %s\n", name, from, to)
		},
	}
	breaker = gobreakeraker.NewCircuitBreaker(settings)
}

func main() {
	http.HandleFunc("/process", handler)
	fmt.Println("Starting API Gateway with Retry and Circuit Breaker on :8080")
	http.ListenAndServe(":8080", nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	// Use the circuit breaker
	_, err := breaker.Execute(func() (interface{}, error) {
		// This function is what the circuit breaker wraps.
		// It contains the actual call to the "remote" service.

		var backoff = initialBackoff
		for i := 0; i <= maxRetries; i++ {
			// Simulate network latency and potential failure
			delay := time.Duration(rand.Intn(500)) * time.Millisecond
			time.Sleep(delay)

			// Simulate a failure rate that increases with request volume
			if rand.Intn(100) < 20 { // 20% chance of failure
				err := fmt.Errorf("simulated internal server error")
				fmt.Printf("Attempt %d failed: %v. Retrying in %s...\n", i+1, err, backoff)
				if i < maxRetries {
					time.Sleep(backoff)
					backoff *= backoffMultiplier
				}
				// Return the error to the circuit breaker
				return nil, err
			} else {
				// Success! Return nil error and a success message.
				return "Request processed successfully!", nil
			}
		}
		// Should not be reached if maxRetries is handled, but for completeness
		return nil, fmt.Errorf("all retries exhausted within breaker execution")
	})

	if err != nil {
		// Check if the error is from the circuit breaker itself (e.g., "circuit breaker is open")
		if err == gobreaker.ErrOpenState || err == gobreaker.ErrHalfOpenState {
			http.Error(w, fmt.Sprintf("Service unavailable: %v", err), http.StatusServiceUnavailable)
		} else {
			// This is an error returned from the wrapped function after retries
			http.Error(w, fmt.Sprintf("Processing failed: %v", err), http.StatusInternalServerError)
		}
		return
	}

	// If we reach here, the breaker.Execute call succeeded.
	fmt.Fprint(w, "Request processed successfully!")
}

The breaker.Execute function will run the provided anonymous function. If that function returns an error, the circuit breaker logic kicks in. If the breaker is "open," breaker.Execute will immediately return gobreaker.ErrOpenState without running the anonymous function at all. This prevents us from even attempting to call a potentially dead service.

The combination of retries with exponential backoff and circuit breakers provides a robust strategy for handling transient and persistent failures in distributed systems. They allow your services to be resilient to temporary network glitches or overloaded backends, while also protecting against repeatedly calling services that are truly down.

The ReadyToTrip function in the circuit breaker settings is crucial; it defines when the breaker decides to open. A common mistake is to set the MaxRequests too high or the Interval too low, causing the breaker to trip too aggressively on minor, temporary spikes.

The next challenge is managing these patterns across many services and ensuring consistent configuration.

Want structured learning?

Take the full Event-driven course →