Prometheus doesn’t actually monitor etcd; it collects metrics from etcd that describe its health.
Let’s see etcd in action. Imagine a distributed key-value store where multiple nodes need to agree on the state of data. etcd uses a consensus algorithm (Raft) to achieve this. When etcd nodes communicate, they send heartbeats and proposals. Prometheus scrapes metrics exposed by etcd, giving us visibility into these internal operations.
Here’s a typical etcd configuration snippet that enables the Prometheus metrics endpoint:
listen-peer-urls: http://0.0.0.0:2380
listen-client-urls: http://0.0.0.0:2379
advertise-client-urls: http://<NODE_IP>:2379
initial-advertise-peer-urls: http://<NODE_IP>:2380
name: <NODE_NAME>
data-dir: /var/lib/etcd
The listen-peer-urls and listen-client-urls are crucial for inter-node communication and for Prometheus to scrape metrics, respectively. The advertise-client-urls is what Prometheus will use to discover and connect to the etcd instance.
The problem etcd health monitoring solves is preventing data corruption or unavailability in distributed systems that rely on etcd (like Kubernetes). If etcd becomes unhealthy, the entire system can grind to a halt. Prometheus, with its pull-based model, can periodically query etcd’s metrics endpoint, typically /metrics on port 2379.
Here are some key etcd metrics and what they tell us:
etcd_server_has_leader: A gauge that’s 1 if the etcd cluster has a leader, and 0 otherwise. A cluster without a leader cannot process writes.etcd_server_leader_changes_seen_total: A counter that increments every time a leader change occurs. Frequent leader changes indicate instability.etcd_mvcc_db_total_size_in_bytes: The total size of the etcd database. Growing too large can impact performance.etcd_network_peer_round_trip_time_seconds: Measures the round-trip time for messages between etcd peers. High latency can lead to timeouts and instability.etcd_server_proposals_failed_total: A counter for failed proposals. Failed proposals mean consensus couldn’t be reached.etcd_server_raft_log_commit_duration_seconds: Measures how long it takes to commit a Raft log entry. High values suggest Raft is struggling.
Let’s set up some Prometheus alerts based on these metrics.
Alert: High etcd Leader Changes
- alert: HighEtcdLeaderChanges
expr: rate(etcd_server_leader_changes_seen_total[5m]) > 0.01
for: 10m
labels:
severity: warning
annotations:
summary: "Etcd cluster is experiencing frequent leader changes."
description: "The etcd cluster has seen more than 0.01 leader changes per second over the last 10 minutes. This indicates instability and potential network issues or overloaded nodes."
This alert fires if the rate of leader changes exceeds a small threshold over a sustained period. Frequent leader changes are a strong indicator of cluster instability, often caused by network partitions or overloaded etcd nodes that cause Raft leader elections to time out.
Alert: Etcd Cluster Has No Leader
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Etcd cluster has no leader."
description: "The etcd cluster has not elected a leader for 2 minutes. This means no writes can be processed, and the cluster is effectively down."
This is a critical alert. If etcd_server_has_leader drops to 0 and stays there, it means the cluster cannot function. This usually points to a complete network failure between nodes, a majority of nodes being down, or severe resource contention preventing leader election.
Alert: High Raft Log Commit Duration
- alert: HighEtcdRaftCommitDuration
expr: histogram_quantile(0.99, sum(rate(etcd_server_raft_log_commit_duration_seconds_bucket[5m])) by (le, server_id)) > 1.0
for: 15m
labels:
severity: warning
annotations:
summary: "Etcd Raft log commit duration is high."
description: "The 99th percentile of etcd Raft log commit duration is exceeding 1 second for 15 minutes. This suggests that etcd is struggling to reach consensus, potentially due to high disk I/O latency or network congestion."
This alert uses a histogram metric to observe the tail latency of Raft log commits. If 99% of commits are taking longer than a second, it means the underlying storage or network is a bottleneck. etcd relies on timely disk writes to durably store Raft entries before they can be committed. Slow disks will directly translate to slow commits.
Alert: Etcd Database Size Exceeding Threshold
- alert: EtcdDatabaseTooLarge
expr: etcd_mvcc_db_total_size_in_bytes / (1024*1024*1024) > 50 # Example: 50GB
for: 30m
labels:
severity: warning
annotations:
summary: "Etcd database size is approaching limit."
description: "The etcd database size has exceeded 50GB for 30 minutes. A large database can lead to performance degradation and longer backup/restore times. Consider compacting etcd or archiving old data."
This alert checks the raw size of the etcd database file. While etcd has internal compaction mechanisms, if the rate of writes or the retention of historical data outpaces compaction, the database can grow uncontrollably. This impacts read/write performance and can eventually fill up disk space.
Alert: High Peer Round Trip Time
- alert: HighEtcdPeerRTT
expr: avg by (server_id) (etcd_network_peer_round_trip_time_seconds) > 0.5 # Example: 500ms
for: 5m
labels:
severity: warning
annotations:
summary: "High etcd peer round trip time detected."
description: "The average round trip time between etcd peers has exceeded 0.5 seconds. This indicates network latency issues that can impact Raft consensus and overall cluster stability."
This alert monitors the network latency between etcd peers. High RTT means that heartbeats and Raft messages take longer to travel, increasing the likelihood of timeouts, leader oscillations, and potential split-brain scenarios.
The most surprising thing about etcd’s internal workings is how sensitive it is to disk I/O latency. While network issues are often blamed, a slow disk can cripple an etcd cluster just as effectively. Raft requires that log entries be durably written to disk before they can be committed. If fsync() calls are taking hundreds of milliseconds or even seconds, the entire consensus process slows to a crawl, leading to leader elections, proposal timeouts, and eventual cluster unavailability. This is why etcd_server_raft_log_commit_duration_seconds is such a critical metric to monitor.
Once you have these alerts in place, the next logical step is to integrate them with a notification system like Alertmanager to route them to the appropriate on-call engineers.