Write amplification is when your database writes more data to disk than the application logically intends. In CockroachDB, this usually means your cluster is spending too much time on disk I/O, impacting performance and increasing storage costs.
Let’s see what a typical write amplification problem looks like in action. Imagine you’re running a workload that inserts 100 bytes per transaction.
# Simulate a simple insert workload
cockroach workload run kv --insecure --duration=1m \
--insert-range '0-1000000' --scatter-inserts \
'--max-key-size=10' '--max-value-size=100'
Now, let’s look at the performance metrics. You’d expect disk writes to be roughly proportional to the data inserted. If you’re seeing disk I/O orders of magnitude higher than your logical write rate, you’ve likely got write amplification.
-- Monitor disk I/O per node
SELECT
node_id,
sum(bytes_written) AS total_bytes_written,
sum(bytes_read) AS total_bytes_read
FROM
crdb_internal.node_disk_statistics
GROUP BY
node_id
ORDER BY
node_id;
-- Check for high write latency on specific ranges
SELECT
range_id,
avg(kv.write.latency) AS avg_write_latency,
sum(kv.write.total_ops) AS total_write_ops
FROM
crdb_internal.kv_flow_control_statistics
WHERE
kv.write.total_ops > 0
GROUP BY
range_id
ORDER BY
avg_write_latency DESC
LIMIT 10;
The fundamental problem write amplification creates is increased latency and reduced throughput. Every byte written to disk by CockroachDB is part of a larger operation, and if those operations are inefficient, the entire system grinds to a halt. This isn’t just about the initial data write; it includes writes for WAL (Write-Ahead Log), MVCC (Multi-Version Concurrency Control) overhead, compactions, and replication.
Here’s how CockroachDB’s storage engine, Pebble, handles writes, and why amplification happens:
- Log-Structured Merge-Tree (LSM-Tree): Pebble uses an LSM-tree. Writes are first appended to an in-memory memtable and a WAL file. When the memtable is full, it’s flushed to disk as an immutable SSTable (Sorted String Table) file in Level 0.
- Compactions: To keep read performance high and manage disk space, SSTables are periodically "compacted." This involves merging SSTables from higher levels into new SSTables at lower levels, discarding deleted or overwritten versions of keys. These merges and writes are a primary source of amplification.
- MVCC Overhead: CockroachDB is a multi-version database. Every update or delete creates a new version of a key. While older versions are eventually garbage collected, the system must manage these versions, which adds to write traffic.
- Replication: Data is replicated across multiple nodes. When a write occurs on a range, it’s sent to all replicas, meaning the logical write is physically written multiple times across the cluster.
The core of managing write amplification lies in understanding and tuning the factors that influence LSM-tree behavior and data versioning.
Common Causes and Fixes for Write Amplification
1. High Write Throughput with Small Writes:
- Diagnosis: Look for a high rate of
kv.write.total_opsincrdb_internal.kv_flow_control_statisticscoupled with a low average value size. Also, checkcrdb_internal.l0_compactionsfor frequent L0 compactions. - Cause: Many small, independent writes (e.g., single-row inserts, updates to small values) lead to frequent memtable flushes and L0 compactions. Each flush creates a new SSTable, and L0 compactions are the most expensive as they involve merging potentially many small files.
- Fix: Batch your writes. Instead of individual
INSERTorUPDATEstatements, useINSERT ... VALUES (...), (...), ...orUPSERTstatements that include multiple rows. For complex transactions, ensure they operate on related keys that can be processed together.- Example Fix: If you are performing 1000 individual
INSERT INTO foo VALUES (1, 'data1');statements, change it toINSERT INTO foo VALUES (1, 'data1'), (2, 'data2'), ..., (1000, 'data1000');. - Why it Works: Batching reduces the number of memtable flushes and L0 compactions per unit of data, as a single larger write can fill a memtable more efficiently and result in fewer, larger SSTables being created.
- Example Fix: If you are performing 1000 individual
2. Large Values and Frequent Updates:
- Diagnosis: Examine
crdb_internal.kv_flow_control_statisticsfor highkv.write.latencyon ranges with large average value sizes (max_value_sizefrom your workload). Also, monitorcrdb_internal.mvcc_statsfor a highgc_bytes_age_thresholdor a large number of versions. - Cause: Updating large values frequently causes entire SSTables containing those keys to be rewritten during compactions, even if only a small portion of the value changed. This is because SSTables are immutable, and a "write" to an existing key means a new version is written, and the old version is marked for GC.
- Fix: If possible, break down large values into smaller, independent key-value pairs. If a large blob needs to be updated, consider if only a part of it is truly changing and if that part can be a separate key.
- Example Fix: Instead of
UPDATE large_object SET data = 'new_large_data' WHERE id = 1;, considerUPSERT large_object_part1 SET data = 'new_part1_data' WHERE id = 1;andUPSERT large_object_part2 SET data = 'new_part2_data' WHERE id = 1;ifdatacan be logically split. - Why it Works: Updating smaller, distinct keys results in fewer SSTables being involved in compactions for that specific update, reducing the amount of data rewritten.
- Example Fix: Instead of
3. Suboptimal max_bytes_per_range:
- Diagnosis: Check
crdb_internal.rangesfor ranges that are significantly larger or smaller than the cluster’smax_bytes_per_rangesetting. Also, observecrdb_internal.range_split_rebalance_countsfor excessive range splits or merges. - Cause: Ranges that are too large can lead to inefficient compactions and reads as Pebble has to process more data within a single range. Conversely, very small ranges can lead to a high number of ranges, increasing metadata overhead and potentially leading to more frequent splits and merges, which are write-intensive operations.
- Fix: Adjust
max_bytes_per_range. The default is 512MB. For workloads with very large values or high write rates, you might consider increasing this value (e.g., to 1GB or 2GB). For workloads with many small rows, you might consider decreasing it, but this is less common for write amplification.- Example Fix:
ALTER DATABASE your_database SET CLUSTER SETTING range_max_bytes = 1073741824; -- 1GB - Why it Works: A larger
max_bytes_per_rangeallows ranges to grow larger before splitting, which can lead to fewer, larger SSTables per range and potentially more efficient compactions. It reduces the overhead associated with managing a vast number of small ranges.
- Example Fix:
4. High MVCC Garbage Collection (GC) Overhead:
- Diagnosis: Monitor
crdb_internal.mvcc_statsfor a highgc_bytes_age_thresholdand a largelive_bytesrelative tototal_bytes. Also, observe thegc.requests_totalmetric incrdb_internal.jobsfor frequent GC jobs. - Cause: If GC is running too frequently or is unable to keep up with new writes, old versions of data accumulate. This increases the total amount of data on disk that Pebble needs to manage during compactions, amplifying writes as more versions are merged.
- Fix: Ensure your
gc.ttlsecondssetting is appropriate for your data’s lifecycle. For very active datasets where data is frequently updated or deleted, a shorter TTL might be necessary. Also, ensure your cluster has sufficient I/O capacity to handle background GC.- Example Fix:
ALTER TABLE your_table CONFIGURE ZONE USING gc.ttlseconds = 3600; -- 1 hour(Adjustyour_tableand the TTL value). - Why it Works: A shorter GC TTL means older versions of data are removed from the system more quickly, reducing the total amount of data (live and dead) that Pebble needs to process during compactions and therefore reducing write amplification.
- Example Fix:
5. Inefficient Compaction Settings:
- Diagnosis: Observe
crdb_internal.l0_compactionsandcrdb_internal.level_compactionsfor exceptionally high rates of compactions, especially L0. Checkcrdb_internal.lsm_metricsfor highmemtable_hit_rateormemtable_size. - Cause: Default Pebble compaction settings are generally well-tuned, but extreme workloads can benefit from adjustments. If L0 compactions are overwhelming, it indicates too many memtables are being flushed. If lower-level compactions are frequent, it might mean SSTables are too small or too numerous.
- Fix: While direct tuning of Pebble’s compaction strategy is advanced and generally discouraged unless you have deep expertise, you can indirectly influence it by adjusting memtable flush sizes (
memtable_size) and target SSTable sizes (lsm.max-table-size).- Example Fix:
ALTER RANGE default CONFIGURE ZONE SETTINGS lsm.max_table_size = 256 * 1024 * 1024; -- 256MB(This is a significant change and requires careful consideration and testing). - Why it Works: Increasing
lsm.max_table_size(within reasonable limits) can lead to fewer, larger SSTables at lower levels, potentially reducing the frequency of compactions between those levels. Conversely, if L0 is the bottleneck, ensuring memtables are sized appropriately to absorb writes efficiently is key.
- Example Fix:
6. Excessive Indexing on Frequently Updated Columns:
- Diagnosis: Identify indexes on columns that are frequently updated. Use
SHOW INDEX FROM your_tableand correlate with write patterns. - Cause: Every index on a table is a separate key-value store. When a row is updated, all indexes referencing that row must also be updated. If an index is on a column that changes often, this leads to a high volume of writes per logical row update.
- Fix: Review your indexes. Remove any secondary indexes that are not critical for query performance. Consider if an index is truly needed on a column that experiences frequent churn.
- Example Fix:
DROP INDEX IF EXISTS my_table_frequently_updated_idx ON my_table; - Why it Works: Removing an index eliminates the need to write updates to that index’s data structure every time the indexed column changes, directly reducing the total number of writes.
- Example Fix:
After addressing these common causes, your next potential bottleneck might be read amplification if your compaction strategy is still too aggressive or if your data distribution leads to many SSTable lookups for a single read.