Event-driven architectures, despite their resilience claims, are surprisingly fragile when downstream services falter. Circuit breakers are a vital pattern for preventing cascading failures, but their application in event-driven systems requires a nuanced understanding beyond simple synchronous request/response.

Let’s see how this plays out with a concrete example. Imagine a UserSignup event that triggers several downstream processes: sending a welcome email, creating a user profile, and adding the user to a marketing list.

{
  "eventType": "UserSignup",
  "timestamp": "2023-10-27T10:00:00Z",
  "payload": {
    "userId": "user123",
    "email": "test@example.com",
    "name": "Jane Doe"
  }
}

If the MarketingService responsible for adding users to the marketing list becomes overloaded or experiences an outage, without circuit breakers, the EventBus might keep retrying to deliver the UserSignup event to it. This can exhaust resources on the EventBus and the MarketingService, potentially impacting other event consumers and the overall system.

Here’s where circuit breakers come in. We can wrap the consumption of events by the MarketingService with a circuit breaker.

Circuit Breaker Mechanics

A circuit breaker has three states:

  1. Closed: Operations are allowed. If failures exceed a threshold within a time window, the breaker "trips" to the Open state.
  2. Open: Operations are immediately rejected without execution. After a timeout, it transitions to Half-Open.
  3. Half-Open: A limited number of operations are allowed. If they succeed, the breaker closes; otherwise, it returns to Open.

Applying Circuit Breakers in Event Consumption

Consider a Kafka consumer group for the MarketingService. We’ll use a conceptual Java library like Resilience4j for illustration, but the principles apply to any event-driven framework.

Configuration:

// Configuration for the MarketingService's event consumer circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50) // Trip if 50% of calls fail
    .waitDurationInOpenState(Duration.ofSeconds(30)) // Stay open for 30 seconds
    .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.TIME)
    .slidingWindowSize(60) // Look at failures over the last 60 seconds
    .recordExceptions(
        IOException.class, // Network issues, etc.
        TimeoutException.class, // Service took too long
        ServiceUnavailableException.class // Explicitly marked as unavailable
    )
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("marketingServiceConsumer", config);

Consumer Logic:

public void consumeUserSignupEvent(Event event) {
    // Decorate the actual processing logic with the circuit breaker
    CheckedFunction<Event, Void> processingLogic = e -> {
        try {
            // Simulate calling the downstream marketing list API
            marketingListApi.addUser(e.getPayload().getUserId());
            log.info("User {} added to marketing list.", e.getPayload().getUserId());
            return null;
        } catch (Exception ex) {
            log.error("Failed to add user {} to marketing list: {}", e.getPayload().getUserId(), ex.getMessage());
            // Re-throw to be caught by the circuit breaker
            throw ex;
        }
    };

    try {
        // Execute the decorated logic. If the circuit breaker is open,
        // this will immediately throw a CallNotPermittedException.
        circuitBreaker.execute(processingLogic, event);
    } catch (CallNotPermittedException e) {
        log.warn("Circuit breaker for marketingServiceConsumer is OPEN. Skipping event for user {}. Reason: {}",
                 event.getPayload().getUserId(), e.getMessage());
        // Important: Do NOT acknowledge the message in Kafka here.
        // Let Kafka's retry mechanism (or a dead-letter queue strategy) handle this.
        // If we acknowledge, we lose the event.
        throw new RuntimeException("Circuit breaker open, event not processed", e);
    }
}

When the MarketingService consumer receives a UserSignup event, it attempts to call marketingListApi.addUser(). If this call fails (e.g., due to a network error, timeout, or the downstream API returning a 5xx error), the circuitBreaker records the failure. If enough failures occur within the configured slidingWindowSize (60 seconds), the failureRateThreshold (50%) is met, and the circuitBreaker transitions to the OPEN state.

Once OPEN, any subsequent UserSignup events reaching this consumer will immediately trigger a CallNotPermittedException. The consumeUserSignupEvent method catches this, logs a warning, and crucially, does not acknowledge the message to Kafka. This allows Kafka to redeliver the message later, giving the MarketingService time to recover. After waitDurationInOpenState (30 seconds), the breaker moves to HALF-OPEN, allowing a few calls to test recovery.

The most surprising aspect of applying circuit breakers to event-driven systems is that the breaker often sits on the consumer side, protecting the consumer’s ability to process an incoming event, rather than on a producer’s ability to send an event. This is because the bottleneck and failure point in event-driven systems are frequently the downstream processors struggling to keep up or being unavailable.

If the MarketingService fails to process events for a prolonged period, its Kafka consumer will stop acknowledging messages. Depending on the Kafka broker configuration and consumer group settings, this can lead to the partition being stalled, potentially impacting other consumers on the same partition if they are also affected or if the broker has limits on unacknowledged messages. A robust strategy involves configuring a max.poll.records to a small number and potentially implementing a custom retry mechanism with exponential backoff for individual message processing failures, before relying on Kafka’s default retry behavior or directing to a dead-letter queue.

The next challenge you’ll face is managing events that are repeatedly failing and cannot be processed even after the circuit breaker has reset, leading you to explore dead-letter queue strategies.

Want structured learning?

Take the full Circuit-breaker course →