A saga is a sequence of local transactions where each transaction updates data within a single service and publishes a message or event to trigger the next transaction in the saga.

Let’s see a typical distributed saga flow in action. Imagine an e-commerce system where a customer places an order.

graph TD
    A[Order Service: Create Order] --> B{Payment Service: Process Payment};
    B --> C[Inventory Service: Reserve Stock];
    C --> D[Shipping Service: Schedule Shipment];
    D --> E[Order Service: Update Order Status to Shipped];

When an order is placed, the Order Service creates an order with a PENDING status. It then calls the Payment Service to process the payment. If the payment is successful, the Payment Service updates its internal state and triggers the Inventory Service to reserve the ordered items. If inventory is available, the Inventory Service reserves the stock and calls the Shipping Service to schedule the shipment. Finally, upon successful scheduling, the Shipping Service notifies the Order Service to update the order status to SHIPPED.

The beauty of this distributed saga is that each service is autonomous. It only knows about its local transactions and how to react to events from other services. This avoids the complexities of two-phase commit (2PC) across multiple services, which is notoriously difficult to implement and scale in distributed systems.

The core problem sagas solve is maintaining data consistency across multiple services without a global transaction manager. In a monolithic application, a single database transaction would ensure atomicity. In a microservices architecture, this is no longer possible. Sagas provide a pattern to achieve eventual consistency.

The "central coordinator" in this context isn’t a single point of failure that manages the entire transaction. Instead, it’s often embodied by the message broker (like Kafka or RabbitMQ) and the logic within each service. Each service acts as a participant, performing its local transaction and publishing an event. A central orchestrator service can be implemented, but it’s more common to see a choreography-based approach where services react to each other’s events.

Here’s how you might configure a service to listen for events and trigger subsequent actions. In this example, using Spring Boot with Kafka:

@KafkaListener(topics = "payment.processed", groupId = "inventory-service-group")
public void handlePaymentProcessed(PaymentProcessedEvent event) {
    // Logic to reserve inventory based on event.orderId and event.items
    inventoryService.reserveStock(event.getOrderId(), event.getItems());
    // Publish InventoryReservedEvent
    kafkaTemplate.send("inventory.reserved", event.getOrderId(), new InventoryReservedEvent(event.getOrderId(), event.getItems()));
}

In this snippet, the Inventory Service listens for payment.processed events. Upon receiving one, it attempts to reserve stock. If successful, it publishes an inventory.reserved event, which would then be consumed by the Shipping Service.

The flip side of this distributed nature is handling failures. If any step in the saga fails, compensating transactions must be initiated to undo the work already done. For instance, if the Inventory Service cannot reserve stock, it should publish an inventory.unavailable event. The Payment Service would then listen for this event and initiate a refund (its compensating transaction).

graph TD
    A[Order Service: Create Order] --> B{Payment Service: Process Payment};
    B --> C[Inventory Service: Reserve Stock];
    C -- Inventory Unavailable --> B;
    B -- Refund Payment --> A;
    A -- Cancel Order --> E[Order Service: Update Order Status to Cancelled];

Here, if the Inventory Service fails to reserve stock, it signals back. The Payment Service then performs a refund, and the Order Service updates the order status to CANCELLED. This compensation logic needs to be carefully designed for each step.

The critical aspect of saga coordination, whether through choreography (event-driven) or orchestration (centralized state machine), is ensuring that compensating actions are reliably executed. A common pitfall is assuming that if a forward step succeeds, its compensation will be trivial. However, compensating a payment might involve complex refund processes, or compensating inventory reservation might require re-stocking logic.

When designing your compensating transactions, think about idempotency. If a service receives the same compensation event multiple times due to network issues or retries, it should only perform the compensation once. This is often achieved by checking the current state of the data before applying the compensation.

The next challenge you’ll likely encounter is managing the complexity of long-running sagas and ensuring auditability. Keeping track of the state of a multi-step, multi-service transaction across days or even weeks requires robust logging and monitoring.

Want structured learning?

Take the full Event-driven course →