The most surprising thing about leader election is that it’s not about picking the "best" node, but the node that survives longest in a chaotic network.

Let’s watch Bully in action. Imagine three nodes: Node A (ID 10), Node B (ID 20), and Node C (ID 30). Node A is currently the leader.

Suddenly, Node A crashes.

  • Node B (20) detects A is gone: B sends "election" messages to all nodes with higher IDs. In this case, it sends to Node C (30).
  • Node C (30) receives B’s election message: C realizes it has a higher ID than B. C sends a "you lose" message to B. B, being "bullied," gives up the election.
  • Node C (30) is the only one left with a higher ID: C declares itself the new leader.

Now, consider Raft. Raft builds on the idea that a distributed system can agree on a log of operations, and the leader is simply the node that’s currently appending to that log.

Here’s a simplified Raft cluster with three nodes: Leader (L), Follower1 (F1), Follower2 (F2). All are running on distinct servers.

Initial State:

  • L is leader, term 5.
  • F1 and F2 are followers, term 5.
  • All nodes have an empty log.

Scenario: Leader Crash L crashes.

  1. Timeout: F1 and F2, not hearing from L, time out. Their election timeouts are randomized, say F1 times out at 150ms, F2 at 180ms.
  2. F1 becomes Candidate: F1 increments its term to 6, votes for itself, and sends RequestVote RPCs to F2 and the (now dead) L.
    • RequestVote(term=6, candidateId=F1, lastLogIndex=0, lastLogTerm=0)
  3. F2 receives RequestVote from F1: F2 sees that F1’s term (6) is greater than its own (5). F2 grants the vote to F1 and resets its election timer.
  4. F1 becomes Leader: F1 receives the vote from F2 and realizes no other node with a higher term has sent an RequestVote RPC. F1 becomes the new leader for term 6.
  5. F1 sends AppendEntries: F1 starts sending AppendEntries RPCs (initially empty heartbeats) to F2 to establish its leadership and reset F2’s election timer.
    • AppendEntries(term=6, leaderId=F1, prevLogIndex=0, prevLogTerm=0, entries=[], commitIndex=0)
  6. F2 receives AppendEntries: F2 sees that F1’s term (6) is greater than its own (5). F2 acknowledges F1 as leader, updates its term to 6, and resets its election timer.

The cluster has a new leader.

ZooKeeper’s approach to leader election is embedded within its ensemble management. When a ZooKeeper ensemble starts, or when a leader fails, nodes initiate a consensus protocol to elect a new leader. This protocol is called Zab (ZooKeeper Atomic Broadcast).

Consider a 5-node ZooKeeper ensemble: Z1, Z2, Z3, Z4, Z5.

Initial Bootstrapping:

  • Each node starts as a LOOKING state.
  • They broadcast messages declaring their state and a myid (their unique identifier) and current_zxid (the last transaction ID they know about).
  • Nodes then enter a voting phase. Each node sends a vote to every other node. A vote consists of the sender’s myid, current_zxid, and the election_tick (a counter that increases with each election round).
  • A node accepts a vote if:
    • The sender’s election_tick is greater than its own.
    • OR, the election_tick is the same, and the sender’s myid is greater than its own.
  • The node with the highest election_tick wins. If there’s a tie in election_tick, the node with the highest myid wins.
  • Once a node has received votes from a majority of the ensemble for the same candidate, it transitions to the LEADING (if it’s the elected leader) or FOLLOWING (if it voted for someone else) state.

Key Zab Mechanics:

  • Election Tick: This is a crucial element for breaking ties and ensuring progress. Each time a node doesn’t receive a quorum of votes for its current election round, it increments its election_tick and broadcasts a new vote. This guarantees that eventually, one node will have the highest tick, or a tie will be broken by myid.
  • ZXID Comparison: When election_tick values are equal, the node with the higher current_zxid is preferred. This ensures that the elected leader is the one that has processed the most recent transactions, minimizing potential data loss.

The most counterintuitive aspect of Zab is how it prioritizes progress over strict ordering in the initial election phase. While the current_zxid comparison is vital for data consistency after a leader is established, the election_tick mechanism is designed purely to break deadlocks and ensure an election concludes, even if the initial candidates aren’t necessarily the most up-to-date. This is a pragmatic trade-off: getting a leader quickly, even if it needs to catch up, is often better than getting stuck in an endless election.

Once a leader is elected, it immediately starts sending ACK messages to followers to confirm its leadership and initiate the atomic broadcast of pending transactions. Followers then transition to the FOLLOWING state, acknowledging the new leader.

The next problem you’ll encounter is how this leader then maintains consistency across the followers, a process known as consensus or distributed logging.

Want structured learning?

Take the full Distributed Systems course →