The most surprising thing about Raft is that it prioritizes understandability over raw performance, and this design choice is precisely why it’s so widely adopted.

Let’s watch Raft in action. Imagine a distributed key-value store. We have three servers, node1, node2, and node3, participating in a Raft cluster. Initially, none of them are leaders.

// State: All nodes are Followers
node1: Follower, Term: 0, Log: []
node2: Follower, Term: 0, Log: []
node3: Follower, Term: 0, Log: []

A follower waits for a randomized election timeout. If it doesn’t hear from a leader within that time, it becomes a candidate. Let’s say node1 times out first.

node1 becomes Candidate:

node1 increments its term to 1, votes for itself, and sends RequestVote RPCs to node2 and node3.

// node1 is now Candidate
node1: Candidate, Term: 1, Log: [], VotesReceived: 1 (self)
node2: Follower, Term: 0, Log: []
node3: Follower, Term: 0, Log: []

node2 and node3 receive RequestVote:

If a candidate’s term is greater than a follower’s current term, the follower updates its term and votes for the candidate (if it hasn’t voted already in this term).

node2 receives RequestVote from node1. node1’s term (1) is greater than node2’s term (0). node2 updates its term to 1, votes for node1, and sends voteGranted: true back.

// node2 updates state
node1: Candidate, Term: 1, Log: [], VotesReceived: 1 (self)
node2: Follower, Term: 1, Log: [], VotedFor: node1
node3: Follower, Term: 0, Log: []

node3 also receives RequestVote. It updates its term to 1, votes for node1, and sends voteGranted: true back.

// node3 updates state
node1: Candidate, Term: 1, Log: [], VotesReceived: 2 (self, node2)
node2: Follower, Term: 1, Log: [], VotedFor: node1
node3: Follower, Term: 1, Log: [], VotedFor: node1

node1 becomes Leader:

When node1 receives votes from a majority of servers (2 out of 3 in this case), it becomes the leader for term 1. It immediately sends AppendEntries RPCs (heartbeats) to all followers to assert its leadership and prevent new elections.

// node1 is now Leader
node1: Leader, Term: 1, Log: []
node2: Follower, Term: 1, Log: [], VotedFor: node1
node3: Follower, Term: 1, Log: [], VotedFor: node1

Now, let’s say a client wants to set a key-value pair: SET key1 valueA.

Client sends command to Leader (node1):

node1 appends the command as a new entry to its log.

// node1 appends command
node1: Leader, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ]
node2: Follower, Term: 1, Log: [], VotedFor: node1
node3: Follower, Term: 1, Log: [], VotedFor: node1

Leader replicates log entries:

node1 sends AppendEntries RPCs to node2 and node3, including the new log entry.

// node1 sends AppendEntries to node2 and node3
// node2 and node3 receive AppendEntries
node1: Leader, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ]
node2: Follower, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ] // after appending
node3: Follower, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ] // after appending

Committing the entry:

Once node1 receives acknowledgments from a majority of followers that they have appended the entry to their logs, the entry is considered committed. node1 then applies the command to its state machine.

// node1 applies command
node1: Leader, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ], State: { key1: valueA }
node2: Follower, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ]
node3: Follower, Term: 1, Log: [ { command: "SET key1 valueA", index: 1, term: 1 } ]

The leader then notifies followers of the commit point in a subsequent AppendEntries (or heartbeat) RPC. Followers, upon receiving this notification and having the entry in their log, also apply the command to their state machines. This ensures all replicas agree on the state.

Raft’s core loop is this: leader election if no leader, followed by log replication. The leader is responsible for all client requests and for ensuring all followers have an identical, ordered log. If the leader fails, followers time out, start a new election, and a new leader is chosen. The election process itself is designed to prevent split votes by only allowing a candidate to become leader if it receives votes from a majority, and by ensuring that a candidate’s log is at least as up-to-date as any other server’s.

The mechanism for determining if a candidate’s log is "up-to-date" is crucial. A candidate’s log is considered up-to-date if its last log entry is at an index greater than or equal to the last log entry of any other server. This rule prevents a candidate with an older log from becoming leader and overwriting newer entries that might already be committed on other nodes.

The concept of "committed" entries is what guarantees safety. An entry is committed when it has been replicated to a majority of servers and the leader has acknowledged this. Only committed entries are applied to the state machine. This ensures that once an operation is considered complete by the system, it will never be lost, even if the leader crashes.

The randomized election timeout is a critical component for achieving stability. If all followers had fixed timeouts, they might all time out simultaneously, leading to repeated elections and a lack of a stable leader. By introducing a random jitter, the probability of multiple servers timing out at the exact same moment is significantly reduced.

The leader’s responsibility to send heartbeats (empty AppendEntries RPCs) at regular intervals is what keeps followers from timing out and initiating new elections. If a follower doesn’t receive a heartbeat within its election timeout, it assumes the leader has failed and starts its own election.

Understanding the flow of terms is key. Every leader election starts a new term, and terms are strictly increasing. When a server receives an RPC with a higher term, it immediately transitions to the follower state for that new term. This mechanism ensures that any outdated information from a previous term is discarded.

The next concept to explore is how Raft handles log compaction and snapshotting to manage the ever-growing log size.

Want structured learning?

Take the full Distributed Systems course →