etcd’s Raft consensus algorithm is the unsung hero that prevents your distributed system from devolving into a chaotic, inconsistent mess.

Imagine a distributed database where multiple copies of the same data exist. How do you ensure that every copy is identical, especially when writes are happening concurrently and network issues can cause some nodes to temporarily go offline? That’s where Raft comes in. It’s a protocol that allows a cluster of servers to agree on a single, consistent state, even in the face of failures.

Let’s see etcd in action. Suppose we have a simple key-value store managed by etcd. We’ll use etcdctl to interact with it.

# Start a single etcd node for demonstration
ETCDCTL_API=3 etcdctl --endpoints=http://localhost:2379 put mykey myvalue

This command writes myvalue to mykey. In a clustered etcd setup, this write operation would be proposed to the Raft leader. The leader would then propose this change to its followers. Once a majority of nodes acknowledge the change, it’s considered committed and applied to each node’s local state machine.

# Read the value back
ETCDCTL_API=3 etcdctl --endpoints=http://localhost:2379 get mykey

This get operation will read from the local, consistent state on the node you query. The magic of Raft ensures that even if you hit a different node in a cluster, you’d get the same myvalue because all nodes have agreed on this state.

At its core, Raft works by electing a leader. This leader is responsible for managing the cluster’s state. All write operations must go through the leader. The leader then replicates these operations, as a sequence of log entries, to the other nodes (followers).

Here’s a simplified view of the Raft log:

Node A (Leader): [entry1, entry2, entry3]

Node B (Follower): [entry1, entry2]

Node C (Follower): [entry1, entry2]

When the leader wants to commit entry3, it sends entry3 to Node B and Node C. If a majority (in this case, Node A and one other) acknowledge receiving entry3, it’s considered committed. The leader then applies entry3 to its state machine and tells the followers to do the same. This commitment mechanism is what guarantees that once an entry is committed, it will never be lost, even if some nodes crash and restart.

The leader election process is also crucial. If the current leader fails, the followers will time out and initiate a new election. During an election, nodes vote for a candidate. The candidate that receives votes from a majority of the nodes becomes the new leader. This ensures that the cluster always has a leader to process writes, as long as a majority of nodes are healthy and can communicate.

The actual configuration of etcd nodes, including their peer URLs (--initial-advertise-peer-urls) and client URLs (--listen-client-urls), dictates how they discover and communicate with each other to maintain this consensus. For example, in a cluster, each node would be configured with its peer URL so others know where to send Raft messages.

# Example etcd startup command for a cluster node
etcd \
  --name node1 \
  --data-dir /var/lib/etcd \
  --listen-client-urls http://127.0.0.1:2379 \
  --advertise-client-urls http://127.0.0.1:2379 \
  --listen-peer-urls http://127.0.0.1:2380 \
  --initial-advertise-peer-urls http://127.0.0.1:2380 \
  --initial-cluster etcd=http://127.0.0.1:2380 \
  --initial-cluster-state new

In this configuration, listen-peer-urls is what other etcd nodes use to talk to node1 for Raft operations, and initial-advertise-peer-urls is what node1 tells other nodes to use when reaching out to it. initial-cluster bootstraps the cluster by listing all nodes and their peer URLs.

One subtle but critical aspect of Raft is how it handles log reconciliation. When a follower falls behind or rejoins the cluster, it doesn’t just blindly accept the leader’s log. The leader actually sends log entries with an index and term. A term is essentially a period of time where a leader is active. If a follower receives an entry with a term it doesn’t expect or an index that doesn’t match the previous entry in its log, it will reject the entry. The leader then has to backtrack, sending older entries until it finds a point where the follower’s log is consistent with its own. This meticulous log matching ensures that even after network partitions or node restarts, the cluster can reliably resynchronize.

The next challenge you’ll likely face is understanding how etcd’s watch mechanism leverages this consistent state for real-time updates.

Want structured learning?

Take the full Etcd course →