CouchDB compaction doesn’t actually delete data; it rebuilds the database files, discarding old document versions and indexes.

Let’s see what that looks like. Imagine you have a document, doc1, that you update a few times. CouchDB doesn’t overwrite the old versions. Instead, it appends new versions. Over time, this leads to a lot of "dead" data taking up space.

// Initial state
PUT /mydb/doc1
{ "version": 1, "message": "Hello" }

// Update 1
PUT /mydb/doc1
{ "version": 2, "message": "World" }

// Update 2
PUT /mydb/doc1
{ "version": 3, "message": "CouchDB" }

When you run a compaction, CouchDB reads the current versions of all documents, writes them to a new file, and then replaces the old database files with the new ones. The old files, containing the previous versions of documents, are then marked for deletion by the operating system.

You can trigger compaction manually via the _compact endpoint or configure CouchDB to do it automatically.

Manual Compaction:

To compact a specific database:

curl -X POST http://localhost:5984/mydb/_compact

This will return immediately, but the compaction process will run in the background. You can monitor its progress by checking the compact_running status in the _scheduler/jobs endpoint.

Automatic Compaction:

You can configure automatic compaction in your local.ini or default.ini file. This is usually done in the [compaction] section.

[compaction]
; Compact databases when they grow larger than 50MB
db_threshold = 52428800
; Compact views when their indexes grow larger than 50MB
view_threshold = 52428800
; The interval between compaction checks (in seconds)
check_interval = 60

These settings tell CouchDB to periodically check databases and views against the specified size thresholds and trigger compaction if they exceed them. The check_interval determines how often these checks occur.

Understanding the Process:

CouchDB stores data in B-trees. When you update a document, a new leaf node is created for that document’s latest version. The old leaf node, pointing to the previous version, becomes effectively orphaned. Compaction traverses the B-tree, identifies these orphaned nodes, and reconstructs the tree without them. This process also cleans up old index entries.

The _view_cleanup endpoint is related but distinct. It specifically removes view index files that are no longer referenced by any active view function. Compaction, on the other hand, cleans up document versions and the associated indexes.

It’s crucial to understand that compaction is a resource-intensive operation. It involves significant disk I/O and CPU usage. For large databases, it can take a considerable amount of time and impact the performance of your CouchDB instance. It’s generally recommended to schedule compactions during off-peak hours or to rely on automatic compaction with carefully tuned thresholds.

The most surprising thing about CouchDB’s storage mechanism is how it achieves its high write throughput and fault tolerance: by never overwriting data. Every update creates a new version, and compaction is the mechanism to reclaim space from these historical versions. This append-only approach simplifies concurrency control and crash recovery significantly.

The state of a CouchDB database file is not static; it’s a living entity that grows and shrinks as data is added, updated, and compacted. The underlying .couch files are essentially append-only logs of changes, and compaction is the process of creating a new, condensed log from the active entries.

If you’ve recently compacted a large database and are still seeing high disk usage, the next thing you’ll likely investigate is the operating system’s ability to reclaim the actual disk blocks from the deleted old files.

Want structured learning?

Take the full Couchdb course →