gRPC clients, by default, will keep retrying requests indefinitely when a server is down, leading to cascading failures.

Let’s see this in action. Imagine a user_service that calls a profile_service. If profile_service becomes unavailable, the user_service client will hammer it with requests, eventually exhausting its own resources or overwhelming the network.

Here’s a typical user_service gRPC client setup in Go, without any circuit breaker:

package main

import (
	"context"
	"log"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"

	pb "your_module/profile_service" // Assuming this is your profile service proto
)

func main() {
	conn, err := grpc.Dial("localhost:50051", grpc.WithInsecure())
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()
	c := pb.NewProfileServiceClient(conn)

	for {
		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
		_, err := c.GetProfile(ctx, &pb.ProfileRequest{UserId: "user123"})
		if err != nil {
			st, ok := status.FromError(err)
			if ok {
				// Log the error, but the loop continues
				log.Printf("Error getting profile: %v (Code: %s)", err, st.Code())
			} else {
				log.Printf("Non-gRPC error: %v", err)
			}
		} else {
			log.Println("Successfully got profile")
		}
		cancel()
		time.Sleep(1 * time.Second) // Wait a bit before retrying
	}
}

When profile_service at localhost:50051 is down, this main function will continuously log errors. If user_service is handling thousands of requests, each of those requests will be stuck in this loop, consuming goroutines and network sockets, potentially bringing down the user_service itself.

The problem this solves is uncontrolled retries and resource exhaustion in the face of unreliable downstream services. A circuit breaker acts as a protective layer around your gRPC client calls. It monitors the success and failure rates of your requests to a specific service. When failures exceed a certain threshold, it "opens" the circuit, preventing further requests from being sent to the failing service for a configured period. After this timeout, it enters a "half-open" state, allowing a limited number of test requests to pass through. If these succeed, the circuit closes again; if they fail, it re-opens.

Here’s how you’d integrate the popular sonyflake library (though go-kit/kit/circuitbreaker or hystrix-go are also common choices) to add a circuit breaker to our user_service client:

First, install the library: go get github.com/sony/gobreaker

Now, modify the client code:

package main

import (
	"context"
	"log"
	"time"

	"google.golang.org/grpc"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"

	pb "your_module/profile_service"

	"github.com/sony/gobreaker"
)

func main() {
	// --- Circuit Breaker Setup ---
	// Configure the circuit breaker
	settings := gobreaker.Settings{
		Name: "ProfileService",
		// When the state is Closed, the number of consecutive failures before opening the circuit.
		MaxRequests: 5,
		// When the state is Closed, the time duration that the circuit will remain open.
		Timeout: 1 * time.Minute,
		// When the state is Open, the number of requests that are allowed to pass through.
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			// Trip if more than 70% of requests failed.
			failurePercent := float64(counts.Failure) / float64(counts.Total) * 100
			return counts.Total >= 3 && failurePercent >= 70
		},
		// Custom error handler to determine if an error should be counted as a failure.
		IsSuccessful: func(err error) bool {
			if err == nil {
				return true // Success
			}
			st, ok := status.FromError(err)
			if !ok {
				return false // Non-gRPC error, treat as failure
			}
			// Treat gRPC errors other than "Not Found" or "Unavailable" as failures.
			// You might want to adjust this based on your service's error semantics.
			switch st.Code() {
			case codes.Canceled, codes.DeadlineExceeded, codes.Unavailable, codes.Internal, codes.Unknown:
				return false
			default:
				return true // Other gRPC errors are considered successes (e.g., NotFound)
			}
		},
	}
	cb := gobreaker.NewCircuitBreaker(settings)

	// --- gRPC Client Setup ---
	conn, err := grpc.Dial("localhost:50051", grpc.WithInsecure())
	if err != nil {
		log.Fatalf("did not connect: %v", err)
	}
	defer conn.Close()
	c := pb.NewProfileServiceClient(conn)

	// --- Request Loop with Circuit Breaker ---
	for {
		// Wrap the gRPC call with the circuit breaker
		result, err := cb.Execute(func() (interface{}, error) {
			ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
			defer cancel()
			resp, err := c.GetProfile(ctx, &pb.ProfileRequest{UserId: "user123"})
			if err != nil {
				return nil, err // Return the error to the circuit breaker
			}
			return resp, nil
		})

		if err != nil {
			// Check if the error is from the circuit breaker itself (e.g., circuit open)
			if err == gobreaker.ErrOpenState || err == gobreaker.ErrHalfOpenState {
				log.Printf("Circuit breaker error: %v", err)
			} else {
				// This is the original error from the gRPC call, already handled by IsSuccessful
				log.Printf("gRPC call failed: %v", err)
			}
		} else {
			// Success! result is the protobuf response
			log.Printf("Successfully got profile: %+v", result.(*pb.ProfileResponse))
		}

		time.Sleep(1 * time.Second)
	}
}

The cb.Execute() function takes a function that performs the actual work. The circuit breaker intercepts this function. If the circuit is open, cb.Execute() immediately returns gobreaker.ErrOpenState without even calling the provided function. If the circuit is closed and the provided function returns an error (as determined by IsSuccessful), the circuit breaker increments its failure count. Once MaxRequests are made and ReadyToTrip condition is met (e.g., 70% failure rate), the circuit transitions to Open. While open, calls to cb.Execute return gobreaker.ErrOpenState for Timeout duration. After Timeout, it transitions to HalfOpen, allowing one request. If that request succeeds, the circuit closes. If it fails, it re-opens.

This setup prevents the user_service from continuously hammering a dead profile_service. Instead, after a few failed attempts, it will rapidly return gobreaker.ErrOpenState to its callers, allowing higher-level error handling to kick in (e.g., returning a cached profile, a default profile, or a specific error code to the end-user).

The IsSuccessful function is crucial. It tells the circuit breaker which errors indicate a real problem that should count towards tripping the breaker. For gRPC, you might want to ignore certain transient errors or specific business logic errors (like NotFound) that don’t necessarily mean the service is down.

The next thing you’ll likely encounter is needing to manage the state of these circuit breakers across multiple instances of your user_service, potentially using a distributed store like Redis.

Want structured learning?

Take the full Circuit-breaker course →