Consensus Algorithms: Paxos, Raft, PBFT Explained

The most surprising thing about consensus algorithms is that they don’t actually guarantee agreement; they guarantee that if agreement is reached, it’s consistent.

Let’s see Raft in action. Imagine a simple key-value store replicated across three servers (nodes 1, 2, and 3). When a client wants to set a value, say set user:123 name:Alice, the request first hits the leader. If node 1 is the leader, it appends this command to its log and then sends AppendEntries RPCs to nodes 2 and 3, essentially saying "here’s a new log entry, please replicate it."

// Client sends: set user:123 name:Alice to node 1 (leader)

// Node 1 (Leader)
// Appends to its log:
// Index | Term | Command
// ----------------------
// 5     | 3    | set user:123 name:Alice

// Node 1 sends AppendEntries RPC to Node 2 and Node 3
// RPC payload includes: leaderId, term, prevLogIndex, prevLogTerm, entries[], leaderCommit

// Node 2 (Follower) receives AppendEntries
// If prevLogIndex and prevLogTerm match its log, it appends the entry and replies success.
// Node 2 log:
// Index | Term | Command
// ----------------------
// 5     | 3    | set user:123 name:Alice

// Node 3 (Follower) receives AppendEntries
// If prevLogIndex and prevLogTerm don't match (e.g., network delay, node restarted),
// it replies failure. Node 1 will then decrement its nextIndex for Node 3 and retry.
// Node 3 log: (still has old entry at index 5)
// Index | Term | Command
// ----------------------
// 5     | 2    | get user:456

// After Node 2 replies success, Node 1 knows the entry is replicated on a majority (itself + Node 2).
// Node 1 commits the entry and applies it to its state machine.
// Node 1 state machine: { user:123: { name:Alice } }

// Node 1 then sends a *new* AppendEntries RPC to Node 2 and Node 3.
// This new RPC includes the `commitIndex` value (which is 5).
// Node 2 receives this and commits entry 5.
// Node 2 applies the command to its state machine: { user:123: { name:Alice } }

// Node 3 eventually catches up and receives the commit notification.
// Node 3 applies the command to its state machine: { user:123: { name:Alice } }

The core problem these algorithms solve is achieving replicated state machine functionality. You have a service (like our key-value store) and you want to run identical copies on multiple machines. To keep them in sync, you need a way for all machines to agree on the sequence of operations that modify the state. This sequence is the "log." Consensus algorithms ensure that all non-faulty machines will eventually have the same log, and therefore the same state.

Paxos, Raft, and ZAB (used by ZooKeeper) are different approaches to this problem. Raft was designed for understandability, breaking down consensus into leader election, log replication, and safety. Paxos is more fundamental but notoriously complex to implement correctly. ZAB is similar to Raft but has specific optimizations for ZooKeeper’s use case, particularly around recovery.

In Raft, a cluster always has a single leader. If the leader fails, the remaining nodes elect a new one. The leader is responsible for receiving all client requests, appending them to its log, and replicating those log entries to followers. A log entry is considered "committed" once it’s replicated to a majority of servers. Only committed entries are applied to the state machine. This majority rule is crucial: it ensures that any given committed entry will be present on at least one server that will become a future leader, preventing loss of committed data.

The leader election process is driven by terms, which are monotonically increasing logical clocks. When a server starts up or loses contact with the leader, it becomes a candidate and starts a new election term. It sends RequestVote RPCs to other servers. If a candidate receives votes from a majority of servers, it becomes the leader for that term. Followers will only vote for a candidate if its log is at least as up-to-date as their own, ensuring that the elected leader has all previously committed entries.

The mechanism that prevents divergence is the prevLogIndex and prevLogTerm fields in the AppendEntries and RequestVote RPCs. When a leader sends log entries to a follower, it includes the index and term of the entry immediately preceding the new entries. The follower checks if it has an entry at that index with that term. If it doesn’t, it means the follower’s log has diverged (perhaps due to a previous leader failure or network partition). The leader then knows to decrement its nextIndex for that follower and resend the request with earlier log entries until consistency is found. This "log matching property" guarantees that no two logs in the system can have different entries at the same index.

What most people don’t realize is how much Raft relies on the state of the log entries during leader election. A follower will only vote for a candidate if the candidate’s log is at least as up-to-date as its own. "Up-to-date" is defined as having the highest index and, among logs with the same highest index, having the latest term. This rule is critical for safety: if a leader is elected, it must have all the entries that have been committed so far. If a candidate’s log is older, it cannot be elected, preventing a situation where a new leader might overwrite previously committed data.

Once all nodes have successfully replicated and committed all log entries up to index 5, the next challenge is efficiently handling client requests that involve read-only operations.