etcdserver: mvcc: database space exceeded is the alarm you’re seeing, which means the etcd database has grown too large and is preventing new writes.
The most common reason for this is simply that etcd is storing too much historical data. Every write to etcd creates a new revision, and by default, etcd keeps all of them. Over time, this bloats the database.
Cause 1: Excessive revisions due to frequent writes or long retention.
- Diagnosis: Check the number of revisions in your etcd cluster. This isn’t directly exposed by a single command but can be inferred by looking at alarm status and understanding the typical growth rate. The alarm itself is the primary indicator.
- Fix: Configure etcd’s
auto-compactionto periodically remove old revisions. For example, to keep revisions for the last 10 minutes and compact every 5 minutes:etcdctl alarm disarm --all etcdctl compact <revision_number_to_compact_up_to> etcdctl defrag- Why it works:
auto-compactiontells etcd to automatically delete revisions older than a specified retention period, reducing the database size.defragphysically reclaims space from the underlying storage after compaction. You need to disarm the alarm first to allow writes, then compact, defrag, and the alarm should clear.etcdctl alarm watchcan show you the alarm status.
- Why it works:
- Revision Number: The
<revision_number_to_compact_up_to>should be a recent revision. You can get the latest revision withetcdctl --write-out=json endpoint status | jq -r '.[] | .Revision'. Then, useetcdctl compact <latest_revision_number> --rev.
Cause 2: Large individual keys or values.
- Diagnosis: Identify keys that are exceptionally large. You can use
etcdctl --limit 10 --order totalSize --desc get /to list the largest keys. - Fix: If you find large keys, investigate why they are so big. Often, this is due to storing large objects like certificates, kubeconfigs, or serialized objects directly in etcd. Consider storing these objects elsewhere and only referencing them from etcd. If you must store them, ensure your compaction and defrag strategy is aggressive enough.
etcdctl --limit 10 --order totalSize --desc get /- Why it works: Reducing the size of individual keys directly shrinks the database. Compact and defrag are still necessary to reclaim space.
Cause 3: Leaky applications writing excessive data.
- Diagnosis: Monitor etcd’s write rate and correlate it with application logs. Look for applications that might be creating a large number of keys or updating existing keys very frequently without good reason. Tools like Prometheus with etcd exporter can provide detailed metrics on write operations.
- Fix: Identify the misbehaving application and fix its logic to reduce unnecessary writes or data storage. This is an application-level fix.
- Why it works: Stopping the source of excessive writes prevents the database from growing uncontrollably.
Cause 4: Inadequate disk space on the etcd nodes.
- Diagnosis: Check the available disk space on the nodes where etcd is running.
df -h /var/lib/etcd - Fix: Increase the disk size or free up space on the etcd nodes. Ensure the filesystem hosting etcd’s data directory has enough free space for growth and operations.
# Example: resize the filesystem or add a new disk- Why it works: etcd needs sufficient underlying disk space to operate. If the disk is full, it cannot write new data, leading to errors.
Cause 5: etcd cluster health issues leading to retries or stale data.
- Diagnosis: Check etcd cluster health and member status.
etcdctl endpoint health --cluster etcdctl member list - Fix: Resolve any underlying cluster health issues, such as network partitions, leader election problems, or unhealthy members. A healthy cluster is crucial for efficient operation and preventing data bloat from retries.
# Commands depend on the specific health issue- Why it works: A stable cluster prevents redundant writes and ensures data consistency, which indirectly helps manage database size.
Cause 6: Incorrect configuration of compaction and defrag.
- Diagnosis: Review your etcd configuration for
auto-compaction-retentionand ensuredefragis being run periodically. Many users configure compaction but forget to rundefrag, which is essential for reclaiming disk space. - Fix: Set
auto-compaction-retentionto a reasonable value (e.g.,1hfor 1 hour) and scheduleetcdctl defragto run regularly, perhaps daily or weekly, depending on your write load.# In etcd static pod manifest or configuration file # ... command: - etcd - --auto-compaction-retention=1h # ...- Why it works: Proper configuration ensures that old data is automatically removed and the physical space is reclaimed, keeping the database manageable.
After disarming the alarm, compacting, and defragmenting, if the quota is still exceeded, you’ll likely see the alarm re-trigger or a new error related to the underlying storage (e.g., "no space left on device").