Two-Phase Commit (2PC) is a distributed algorithm that allows a group of nodes to agree on whether to commit or abort a transaction, ensuring atomicity across multiple systems.

Let’s watch 2PC in action with a simplified example involving two services, OrderService and PaymentService.

Imagine a user wants to place an order and pay for it. This is a single logical transaction, but it involves two distinct distributed systems.

Here’s a conceptual flow:

  1. Client Initiates: The client sends a request to OrderService to create an order and process payment.
  2. Coordinator (OrderService): OrderService acts as the coordinator.
    • It begins by creating the order in its local database, marking it as "pending."
    • It then sends a "prepare" request to PaymentService.
  3. Participants (PaymentService): PaymentService receives the "prepare" request.
    • It attempts to deduct the payment from the user’s account.
    • If successful, it writes a "prepared" record to its own transaction log, indicating it’s ready to commit. It then responds "yes" (prepared) to OrderService.
    • If it fails (e.g., insufficient funds), it writes an "aborted" record and responds "no" (aborted) to OrderService.
  4. Coordinator Decides:
    • If OrderService receives "yes" from all participants (in this case, just PaymentService), it decides to commit. It writes a "commit" record to its own log and then sends a "commit" message to PaymentService.
    • If OrderService receives "no" from any participant, or if a participant times out, it decides to abort. It writes an "abort" record to its log and sends an "abort" message to all participants.
  5. Participants Act:
    • PaymentService receives the "commit" or "abort" message.
    • If "commit," it finalizes the payment (e.g., marks the transaction as complete in its database).
    • If "abort," it rolls back the payment deduction.
    • It then sends an "acknowledgement" (ACK) back to the coordinator.
  6. Coordinator Finalizes: Once the coordinator receives ACKs from all participants, the transaction is considered complete.

The core problem 2PC solves is achieving atomicity in distributed systems: either all parts of a transaction succeed, or none of them do. Without it, you could end up with an order placed but no payment, or payment made but no order recorded, leading to data inconsistency.

The "two phases" are:

  • Phase 1: Prepare Phase: The coordinator asks all participants if they are ready to commit. Participants respond "yes" if they can guarantee they can commit, and "no" if they cannot. During this phase, participants must record their "prepared" state durably (e.g., in a write-ahead log) so they can recover and complete the transaction even if they crash after preparing but before receiving the final commit/abort decision.
  • Phase 2: Commit/Abort Phase: Based on the participants’ responses, the coordinator decides whether to commit or abort the transaction. It then informs all participants of its decision. Participants then perform the commit or rollback action and acknowledge completion.

Consider the internal state. A participant in 2PC can be in one of three states: INIT, PREPARED, or DONE (committed or aborted). The coordinator is also managing states for each participant. If a participant crashes and recovers, it checks its logs. If it finds a PREPARED record, it knows it must eventually receive a commit or abort message from the coordinator and will block until it does, to maintain atomicity. If it finds a DONE record, it knows the transaction is finished and it can safely discard the transaction’s state.

The crucial part of the "prepare" phase for a participant is that once it responds "yes," it cannot unilaterally decide to abort later. It must be able to commit if instructed. This is why participants durably log their PREPARED state. If the participant crashes before preparing, it can simply abort the transaction and respond "no." If it crashes after preparing, it’s the coordinator’s job to ensure it eventually gets the final decision.

What most people don’t realize is how much blocking 2PC can introduce. If the coordinator crashes after sending "prepare" but before sending "commit" or "abort," all participants that responded "yes" are stuck. They cannot proceed, cannot release resources, and cannot be told what to do until the coordinator recovers or a manual intervention occurs. This is a significant operational burden in large-scale systems.

The next challenge you’ll face is dealing with the recovery scenarios and the potential for deadlocks when the coordinator fails.

Want structured learning?

Take the full Distributed Systems course →