Recover an etcd Cluster After a Network Partition (2026)

An etcd cluster can survive a network partition and recover automatically if a quorum of nodes can still communicate.

Let’s see etcd in action. Imagine a 3-node etcd cluster. Each node has a unique ID, a name, and an IP address.

# etcd configuration example
name: etcd-node-1
listen-client-urls: http://10.0.0.1:2379
advertise-client-urls: http://10.0.0.1:2379
listen-peer-urls: http://10.0.0.1:2380
initial-advertise-peer-urls: http://10.0.0.1:2380
initial-cluster: etcd-node-1=http://10.0.0.1:2380,etcd-node-2=http://10.0.0.2:2380,etcd-node-3=http://10.0.0.3:2380
initial-cluster-state: new
data-dir: /var/lib/etcd

In a healthy cluster, nodes talk to each other over their listen-peer-urls and initial-advertise-peer-urls. They use the Raft consensus algorithm to agree on the order of operations. A quorum is a majority of nodes; for a 3-node cluster, that’s 2 nodes. As long as a quorum can communicate, the cluster remains available.

Now, let’s simulate a network partition. We’ll block traffic between node 1 and nodes 2/3 using iptables.

On etcd-node-1:

iptables -A INPUT -s 10.0.0.2 -j DROP
iptables -A INPUT -s 10.0.0.3 -j DROP
iptables -A OUTPUT -d 10.0.0.2 -j DROP
iptables -A OUTPUT -d 10.0.0.3 -j DROP

On etcd-node-2 and etcd-node-3:

iptables -A INPUT -s 10.0.0.1 -j DROP
iptables -A OUTPUT -d 10.0.0.1 -j DROP

After this, etcd-node-1 cannot reach etcd-node-2 or etcd-node-3. etcd-node-2 and etcd-node-3 can still reach each other.

etcd-node-1 will eventually time out trying to communicate with its peers and will step down as leader. etcd-node-2 and etcd-node-3 will maintain their quorum, elect a new leader (likely etcd-node-2 if it was already a candidate), and continue serving read and write requests. You can verify this by checking the cluster health:

On etcd-node-1:

ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 endpoint health
# This will likely show no healthy endpoints or timeout.

ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member list
# This might still show all members, but leader will be absent.

On etcd-node-2 (or etcd-node-3):

ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 endpoint health
# Should show all endpoints as healthy.

ETCDCTL_API=3 etcdctl --endpoints=http://127.0.0.1:2379 member list
# Should show all members and indicate a leader.

To recover, we need to remove the iptables rules.

On etcd-node-1:

iptables -D INPUT -s 10.0.0.2 -j DROP
iptables -D INPUT -s 10.0.0.3 -j DROP
iptables -D OUTPUT -d 10.0.0.2 -j DROP
iptables -D OUTPUT -d 10.0.0.3 -j DROP

On etcd-node-2 and etcd-node-3:

iptables -D INPUT -s 10.0.0.1 -j DROP
iptables -D OUTPUT -d 10.0.0.1 -j DROP

Once the network is restored, etcd-node-1 will be able to rejoin the cluster. It will notice that it missed some Raft log entries. It will then communicate with the current leader and catch up by fetching the missing entries. Raft’s log replication ensures that etcd-node-1 will eventually become consistent with the rest of the cluster. The cluster automatically handles this reintegration.

The Raft protocol is designed to be resilient. When a node rejoins after a partition, it doesn’t try to become the leader immediately. Instead, it acts as a follower and requests the latest log entries from the current leader. This prevents split-brain scenarios where two nodes might believe they are the leader. The leader’s role is to append entries to its log and replicate them to a majority of followers. If a node is partitioned out, it cannot participate in this majority, so it cannot become a leader or commit new entries.

The most surprising thing about etcd’s recovery after a partition is that the node that was isolated doesn’t need to be explicitly "re-added" or reconfigured to rejoin the cluster; it simply resumes its role as a follower and catches up on the log. The underlying Raft protocol handles the state synchronization transparently once network connectivity is restored. This is a testament to how well Raft handles transient network failures.

The next problem you’ll likely encounter is dealing with stale client connections that might have been established to the isolated node before it lost leadership.