etcd doesn’t actually store historical versions of your data; it stores the current version and a compact index representing the start of the current transaction’s view.
Here’s how it works under the hood, and what that means for you.
Imagine you have a key foo with value bar. In etcd’s storage, this isn’t just a simple key -> value map. It’s more like a key -> versioned_value map, but with a twist.
When you PUT /foo bar, etcd creates a new entry. This entry has the key foo, the value bar, and importantly, a version number. Let’s say this is version 1.
{
"key": "/foo",
"value": "bar",
"version": 1,
"create_revision": 1,
"mod_revision": 1
}
Now, if you PUT /foo baz, etcd doesn’t overwrite the old entry. It creates a new entry for /foo with the value baz and a new version, say 2.
{
"key": "/foo",
"value": "baz",
"version": 2,
"create_revision": 1,
"mod_revision": 2
}
This is where the "MVCC" (Multi-Version Concurrency Control) comes in. etcd keeps multiple versions of keys available for a certain period.
The Twist: Not True History
The crucial part is that etcd doesn’t keep all historical versions indefinitely. It uses a mechanism called compaction. When you compact etcd, you tell it to remove versions older than a certain revision.
For example, if you compact with revision 1, etcd will discard the entry for /foo with version 1. The storage will then only contain the entry with version 2.
{
"key": "/foo",
"value": "baz",
"version": 2,
"create_revision": 1,
"mod_revision": 2
}
How Retrieval Works with Versions
When you GET /foo, etcd returns the latest version by default.
However, you can request a specific version: GET /foo?version=1. If version 1 still exists (i.e., hasn’t been compacted), etcd will return that specific value.
More commonly, you’ll use revisions for reads. You can request the state of etcd at a specific revision. For instance, GET /foo?revision=1 would return the value of /foo as it was at revision 1.
curl -L http://127.0.0.1:2379/v3/kv/get \
-d '{"key": "'$(echo -n "/foo" | base64)'", "revision": 1}'
This command would fetch the value of /foo as it existed when the global revision was 1. etcd internally maps this revision to the specific version of the key /foo that was current at that point in time.
The Storage Engine: BoltDB’s Role
etcd uses BoltDB (a key-value store) as its underlying storage. BoltDB itself has a concept of read-only transactions and snapshots. etcd leverages this by creating a new "snapshot" of its in-memory state when a new revision is committed.
Each revision in etcd corresponds to a distinct snapshot in BoltDB. When you query a specific revision, etcd essentially "rewinds" to that snapshot.
The create_revision and mod_revision Fields
create_revision: The revision at which the key was first created.mod_revision: The revision at which the key was last modified.
These are crucial for understanding the lifecycle of a key and for performing time-travel queries.
What Most People Don’t Know About Compaction
Many users think compaction deletes old data. It doesn’t. Compaction is etcd’s way of telling BoltDB that it can discard entire snapshots older than a specified revision. The actual key-value pairs within those snapshots are marked as free space within BoltDB’s segment files, but the data might still physically reside on disk until BoltDB performs its own internal garbage collection or segment rotation. This means that even after compaction, the disk usage might not immediately shrink as expected, and recovered etcd data from a disk image could potentially contain "deleted" history.
The Real-World Impact
This MVCC model allows etcd to support features like leases (where keys expire), watch notifications (where clients are notified of changes), and efficient distributed consensus protocols. It ensures that operations within a transaction see a consistent snapshot of the database, preventing race conditions.
The next thing you’ll bump into is how etcd manages its lease system using these revision numbers to track expiration.