Adding more nodes to etcd doesn’t always make it faster or more reliable; it’s a trade-off between write throughput and fault tolerance, and the sweet spot depends on your workload.

Let’s see what etcd actually does when you write data to it. Imagine you have a 3-node cluster. When a client writes a key-value pair, the request hits one of the etcd nodes, let’s call it Node A. Node A is not the leader. It forwards the request to the current leader, Node B. Node B then proposes this write to all other nodes in the cluster. For the write to be committed, a majority of nodes (in a 3-node cluster, that’s 2 nodes) must acknowledge they’ve received and applied the write. Once Node B sees it has a majority, it acknowledges the write back to Node A, which then acknowledges it back to the client. This consensus mechanism, Raft, ensures that all committed writes are ordered consistently across all nodes.

The problem etcd solves is maintaining a consistent, distributed log of critical state information for systems like Kubernetes. It needs to be highly available and guarantee that every node in the cluster sees the same sequence of operations. This is crucial because if nodes in your distributed system (like Kubernetes control plane components) have divergent views of the cluster state, chaos ensues. etcd’s Raft consensus algorithm ensures this consistency, but it comes at the cost of latency for writes, as they require network round trips and agreement from a majority of nodes.

Here’s a typical etcd write flow in a 5-node cluster:

  1. Client Request: A client (e.g., a Kubernetes API server) sends a write request to any etcd node. Let’s say it hits Node 1.
  2. Leader Election/Forwarding: If Node 1 is not the leader, it forwards the request to the current leader. Let’s assume Node 3 is the leader.
  3. Proposal: Node 3 (the leader) creates a log entry for the write and sends it as a "AppendEntries" RPC to all other nodes (Node 1, Node 2, Node 4, Node 5).
  4. Replication: Each follower node (1, 2, 4, 5) receives the log entry. They append it to their own log and send an acknowledgment back to the leader.
  5. Commit: The leader (Node 3) waits for acknowledgments from a majority of nodes. In a 5-node cluster, a majority is 3 nodes (the leader itself plus two followers). Once it receives acknowledgments from Node 1 and Node 4, it knows the entry is committed.
  6. Apply & Respond: The leader applies the committed entry to its state machine (its key-value store) and sends a success response back to the original requester (Node 1), which then responds to the client.

The critical factor here is the "majority" rule. For a cluster of N nodes, you need (N/2) + 1 nodes to agree for a write to be committed. This means:

  • 3 nodes: Majority is 2. Can tolerate 1 node failure.
  • 5 nodes: Majority is 3. Can tolerate 2 node failures.
  • 7 nodes: Majority is 4. Can tolerate 3 node failures.

So, as you increase nodes, you increase fault tolerance. But what about performance? Each write involves network round trips between nodes. More nodes mean more network hops and more nodes to reach consensus with. This increases latency. Specifically, the write latency is dominated by the time it takes for the leader to receive acknowledgments from a majority. In a 3-node cluster, this is typically the leader + 1 follower. In a 7-node cluster, it’s the leader + 3 followers. The extra network hops for those additional acknowledgments add latency.

Consider the "write throughput" curve:

  • 3 Nodes: Offers decent fault tolerance (1 failure) and good write performance because only one other node needs to acknowledge the write. This is often sufficient for many Kubernetes clusters.
  • 5 Nodes: Increases fault tolerance (2 failures) at the cost of slightly higher write latency. The performance hit is usually manageable and provides a good balance for production environments where higher availability is paramount.
  • 7 Nodes: Maximizes fault tolerance (3 failures) but comes with the highest write latency. This configuration is typically reserved for scenarios where extreme availability is a non-negotiable requirement and the performance impact is acceptable or mitigated by other factors (e.g., very fast network, low write volume).

The decision often boils down to a simple calculation: how many nodes can you afford to lose simultaneously without impacting your system’s availability? For most Kubernetes clusters, losing one control plane node is acceptable, but losing two or more is not. This makes 3 or 5 nodes the most common choices.

The "write throughput" performance of an etcd cluster is not directly proportional to the number of nodes. In fact, adding more nodes beyond what’s necessary for fault tolerance can decrease write throughput due to increased consensus overhead. The optimal number of nodes is determined by the desired fault tolerance (how many nodes can fail) and the acceptable write latency for your specific application.

The next common problem you’ll encounter is optimizing etcd’s disk I/O performance, as it’s a major bottleneck for write-heavy workloads.

Want structured learning?

Take the full Etcd course →