etcd, the distributed key-value store that Kubernetes relies on for cluster state, is not just sensitive to storage latency; it’s fundamentally designed around the assumption of near-instantaneous disk operations.
Consider a typical etcd write operation. When a Kubernetes component, say the API server, wants to update a resource (like a Pod definition), it sends a request to etcd. etcd then has to:
- Persist the change to its Raft log: This is a crucial step for durability and consensus. The change isn’t considered committed until it’s written to stable storage by a majority of etcd nodes.
- Apply the change to its key-value store: After the Raft log entry is persisted, etcd applies the actual data modification.
- Respond to the client: Only then can etcd acknowledge the write to the API server.
Each of these steps, especially the Raft log write, must be fast. If the disk write for the Raft log takes too long, the etcd leader will be slow to commit entries, which in turn slows down the entire cluster. Other components waiting for etcd’s confirmation will time out, leading to cascading failures.
The Raft Protocol’s Latency Demands
etcd uses the Raft consensus algorithm to ensure that all nodes agree on the state of the cluster. Raft relies on a leader-election mechanism and a commit process where log entries must be replicated to a majority of nodes and persisted to stable storage before they are considered committed. This "write-ahead log" (WAL) is the linchpin.
When etcd receives a write request, it appends an entry to its WAL file on disk. This is an fsync operation. Only after the data is successfully flushed to the physical storage (SSD in this case) is the entry considered durable. The leader then waits for a quorum of followers to acknowledge they have also persisted the entry. If this WAL write is slow, the leader becomes a bottleneck. It cannot commit the entry, and thus cannot respond to the API server, causing the API server to potentially retry or error out.
Why Not Just Any Disk?
- Mechanical HDDs: Traditional Hard Disk Drives (HDDs) have seek times measured in milliseconds (e.g., 5-15 ms) due to their spinning platters and moving read/write heads. For etcd, this is an eternity. A single fsync operation might take longer than the entire round trip for a network request. This latency means etcd spends most of its time waiting for the disk, drastically reducing throughput and increasing the likelihood of timeouts.
- Consumer-Grade SSDs: While much faster than HDDs, consumer-grade SSDs often have inconsistent write latency, especially under sustained load. Their wear-leveling algorithms and garbage collection can introduce unpredictable "stutters." etcd needs predictable, low latency, not just peak performance. A sudden spike in latency during a critical Raft commit can be as detrimental as a consistently slow disk.
- Networked Storage (NAS/SAN): Networked storage adds another layer of latency. The network hops, protocol overhead (NFS, iSCSI), and potential contention on the storage network mean that even if the underlying storage is fast, the path to it is not. etcd’s WAL writes are sensitive to the end-to-end latency from the etcd process to the actual storage medium.
The Sweet Spot: Low-Latency NVMe SSDs
This is why NVMe (Non-Volatile Memory Express) SSDs are the recommended storage for etcd.
- Direct PCIe Interface: NVMe drives connect directly to the CPU via PCIe lanes, bypassing slower SATA controllers and reducing I/O paths.
- Optimized Protocol: The NVMe protocol is designed for flash memory, offering lower latency and higher IOPS (Input/Output Operations Per Second) compared to older protocols like AHCI.
- Consistent Performance: High-quality enterprise NVMe drives are built for sustained workloads and offer more predictable latency, crucial for Raft’s consensus mechanism.
Diagnosis: To check your storage latency, you can use tools like fio to simulate etcd’s workload. A simple test for sequential write latency would look like this:
sudo apt-get update && sudo apt-get install -y fio # On Debian/Ubuntu
# OR
sudo yum install -y epel-release && sudo yum install -y fio # On RHEL/CentOS
# Run a test writing a 1GB file sequentially
sudo fio --name=etcd-write-test \
--ioengine=sync \
--rw=write \
--bs=4k \
--size=1G \
--direct=1 \
--filename=/path/to/your/etcd/data/dir/fio-test.tmp \
--runtime=60 \
--time_based \
--group_reporting
Look at the clat (completion latency) and lat (total latency) metrics. For etcd, you want these to be consistently in the microseconds range, ideally below 1ms for 99.9% of operations. If you see averages in milliseconds or high percentiles exceeding several milliseconds, your storage is too slow.
Fix: The most direct fix is to migrate your etcd data directory to a local, low-latency NVMe SSD.
- Stop etcd:
sudo systemctl stop etcd(or your specific service manager command). - Copy data:
sudo cp -a /var/lib/etcd /mnt/nvme_ssd/etcd_data(replace paths as appropriate). - Update etcd configuration: Edit your etcd configuration file (e.g.,
/etc/etcd/etcd.conf.ymlor command-line flags) to point thedata-dirto the new location:data-dir: /mnt/nvme_ssd/etcd_data - Restart etcd:
sudo systemctl start etcd. - Verify: Check etcd logs for errors and monitor
etcd_server_wal_fsync_duration_secondsmetrics.
This works because the new storage offers significantly lower and more consistent fsync latency, allowing etcd to keep pace with Raft’s demands. The Raft leader can now commit entries quickly, ensuring that the cluster state is updated promptly and reliably.
The Next Problem: Once etcd storage latency is resolved, you’ll likely start seeing timeouts on API server requests that involve complex list operations or watches, indicating that the API server itself is becoming a bottleneck due to the increased rate of successful etcd operations.