The core idea of distributed consensus algorithms isn’t about agreeing on a value, but agreeing on the order of operations.
Let’s see Raft in action. Imagine you have three servers, s1, s2, and s3, trying to agree on a sequence of commands.
// Initial state: All servers are followers
{
"s1": {"state": "follower", "term": 0, "log": []},
"s2": {"state": "follower", "term": 0, "log": []},
"s3": {"state": "follower", "term": 0, "log": []}
}
Now, s1 decides it wants to become a leader. It increments its term and starts sending AppendEntries RPCs to s2 and s3, asking for votes.
// s1 becomes candidate, then leader
{
"s1": {"state": "leader", "term": 1, "log": []},
"s2": {"state": "follower", "term": 1, "log": []},
"s3": {"state": "follower", "term": 1, "log": []}
}
A client sends a command: "SET key=value". The leader s1 appends this to its log and then sends AppendEntries RPCs to s2 and s3 with this new entry.
// s1 receives command, appends to its log, and replicates
{
"s1": {"state": "leader", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}]},
"s2": {"state": "follower", "term": 1, "log": []},
"s3": {"state": "follower", "term": 1, "log": []}
}
When s2 and s3 acknowledge receiving the entry, s1 knows it’s committed. It then applies the command to its state machine and responds to the client.
// s2 and s3 acknowledge receipt
{
"s1": {"state": "leader", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}]},
"s2": {"state": "follower", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}]},
"s3": {"state": "follower", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}]}
}
// s1 applies the command and responds to client
{
"s1": {"state": "leader", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}], "state_machine": {"key": "value"}},
"s2": {"state": "follower", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}], "state_machine": {}},
"s3": {"state": "follower", "term": 1, "log": [{"index": 1, "term": 1, "command": "SET key=value"}], "state_machine": {}}
}
The key problem these algorithms solve is achieving consistency in a distributed system where nodes can fail or network partitions can occur. They ensure that all nodes eventually agree on the same sequence of operations, even in the face of these failures. Raft achieves this through a leader-based approach, where a single leader is responsible for managing the replicated log. Paxos uses a more decentralized, ballot-based approach. ZAB, used by Apache ZooKeeper, is similar to Raft but has some specific optimizations for ZooKeeper’s use case, notably its handling of leader failure and log recovery.
The most surprising thing about these algorithms is how they manage log divergence. When a leader fails and a new one is elected, the new leader might have a log that’s different from some followers. Instead of trying to reconcile complex differences, Raft’s leaders simply force their logs onto followers. They send entries from their log, and if a follower’s log is inconsistent, the leader tells it, "No, your log is wrong. Here’s the correct entry at this index." The follower then truncates its log and accepts the leader’s entry. This is a powerful, albeit blunt, way to ensure consistency.
The core of Raft’s safety relies on the fact that once an entry is committed (replicated to a majority of nodes), it will never be overwritten or deleted. This is guaranteed by the election mechanism and the AppendEntries RPC. A candidate can only be elected if it receives votes from a majority of servers, and those servers must have logs that are at least as up-to-date as the candidate’s. This ensures that the new leader’s log is always a superset of any previously committed entries.
Most people focus on the leader election and log replication, but the real magic is in the commit index. An entry is considered committed only when it has been replicated to a majority of servers and the leader knows this. The leader then advances its commitIndex, and subsequent AppendEntries RPCs will include this commitIndex. Followers, upon receiving an AppendEntries RPC containing a commitIndex greater than their own, will also advance their commitIndex. This mechanism is what prevents committed entries from being lost.
The next challenge you’ll face is understanding how these algorithms handle network partitions and the complexities of log compaction.