The most surprising thing about etcd’s leader election is that it’s not a competition, but a coordinated dance where a leader volunteers and the rest of the cluster confirms.
Let’s watch it in action. Imagine a fresh etcd cluster with three members: etcd-1, etcd-2, and etcd-3. Initially, no one is in charge. They all start listening on their peer ports (2380) and client ports (2379).
# On etcd-1
ETCDCTL_API=3 etcdctl endpoint health --endpoints=http://etcd-1:2379,http://etcd-2:2379,http://etcd-3:2379
# Expected output:
# http://etcd-1:2379 is unhealthy
# http://etcd-2:2379 is unhealthy
# http://etcd-3:2379 is unhealthy
They’re unhealthy because there’s no leader to process client requests. Now, one of them, say etcd-1, decides it’s time to become the leader. It doesn’t just declare itself leader. Instead, it sends a RequestVote RPC to etcd-2 and etcd-3. This vote request includes its current term number (starting at 1) and its candidate ID (etcd-1).
etcd-1 (candidate, term 1) -> etcd-2: RequestVote(term=1, candidateId=etcd-1)
etcd-1 (candidate, term 1) -> etcd-3: RequestVote(term=1, candidateId=etcd-1)
etcd-2 and etcd-3, upon receiving this, check if they’ve already voted in term 1. If not, and if etcd-1’s log is at least as up-to-date as theirs (a Raft detail for consistency), they vote "yes" and reply with a RequestVoteResponse. They also transition to the "follower" state, acknowledging etcd-1 as a potential leader.
etcd-2 (follower, term 1) <- etcd-1: RequestVoteResponse(term=1, voteGranted=true)
etcd-3 (follower, term 1) <- etcd-1: RequestVoteResponse(term=1, voteGranted=true)
Once etcd-1 receives votes from a majority of the cluster (in this case, 2 out of 3), it knows it’s the leader for term 1. It broadcasts an AppendEntries RPC (even if there’s no data yet) to all followers to announce its leadership.
etcd-1 (leader, term 1) -> etcd-2: AppendEntries(term=1, leaderId=etcd-1, prevLogIndex=0, prevLogTerm=0, entries=[], leaderCommit=0)
etcd-1 (leader, term 1) -> etcd-3: AppendEntries(term=1, leaderId=etcd-1, prevLogIndex=0, prevLogTerm=0, entries=[], leaderCommit=0)
etcd-2 and etcd-3 receive this, acknowledge it, and transition to the "follower" state. Now, the cluster is healthy and has a leader.
# On etcd-1
ETCDCTL_API=3 etcdctl endpoint health --endpoints=http://etcd-1:2379,http://etcd-2:2379,http://etcd-3:2379
# Expected output:
# http://etcd-1:2379 is healthy
# http://etcd-2:2379 is healthy
# http://etcd-3:2379 is healthy
The core problem this solves is distributed consensus: agreeing on a single source of truth in an unreliable, distributed system where nodes can fail and messages can be lost. Raft achieves this by enforcing a strict protocol for leader election and log replication.
Internally, each etcd member maintains its state (follower, candidate, leader) and its Raft log. When a follower doesn’t hear from the leader within a randomized election timeout, it assumes the leader has failed and becomes a candidate. It increments its term, votes for itself, and starts requesting votes from others. This randomized timeout is crucial; it prevents "split votes" where multiple candidates start election simultaneously and never gain a majority.
The leader’s primary job is to replicate log entries to followers. When a client sends a request to the leader, the leader appends it to its own log, then sends AppendEntries RPCs to followers. Once an entry is replicated on a majority of servers, the leader commits it and applies it to its state machine. It then informs followers about the committed entry in subsequent AppendEntries RPCs. Followers apply committed entries to their state machines.
A key detail often overlooked is how etcd handles leader failures during an election. If etcd-1 starts an election (term 1) and etcd-2 also decides to become a candidate (also term 1) before etcd-1 gets its majority, they might both send RequestVote RPCs. However, etcd-2 will increment its term to 2 before sending its votes. If etcd-3 receives a vote request from etcd-1 (term 1) and then from etcd-2 (term 2), it will grant the vote to etcd-2 because it has a higher term. This ensures that only one leader can be elected per term, and any candidate with a lower term will be rejected.
The next concept to explore is how etcd handles data persistence and recovery using its WAL and M3DB.