Write amplification is when your database writes more data to disk than the application logically intends. In CockroachDB, this usually means your cluster is spending too much time on disk I/O, impacting performance and increasing storage costs.

Let’s see what a typical write amplification problem looks like in action. Imagine you’re running a workload that inserts 100 bytes per transaction.

# Simulate a simple insert workload
cockroach workload run kv --insecure --duration=1m \
  --insert-range '0-1000000' --scatter-inserts \
  '--max-key-size=10' '--max-value-size=100'

Now, let’s look at the performance metrics. You’d expect disk writes to be roughly proportional to the data inserted. If you’re seeing disk I/O orders of magnitude higher than your logical write rate, you’ve likely got write amplification.

-- Monitor disk I/O per node
SELECT
  node_id,
  sum(bytes_written) AS total_bytes_written,
  sum(bytes_read) AS total_bytes_read
FROM
  crdb_internal.node_disk_statistics
GROUP BY
  node_id
ORDER BY
  node_id;

-- Check for high write latency on specific ranges
SELECT
  range_id,
  avg(kv.write.latency) AS avg_write_latency,
  sum(kv.write.total_ops) AS total_write_ops
FROM
  crdb_internal.kv_flow_control_statistics
WHERE
  kv.write.total_ops > 0
GROUP BY
  range_id
ORDER BY
  avg_write_latency DESC
LIMIT 10;

The fundamental problem write amplification creates is increased latency and reduced throughput. Every byte written to disk by CockroachDB is part of a larger operation, and if those operations are inefficient, the entire system grinds to a halt. This isn’t just about the initial data write; it includes writes for WAL (Write-Ahead Log), MVCC (Multi-Version Concurrency Control) overhead, compactions, and replication.

Here’s how CockroachDB’s storage engine, Pebble, handles writes, and why amplification happens:

  1. Log-Structured Merge-Tree (LSM-Tree): Pebble uses an LSM-tree. Writes are first appended to an in-memory memtable and a WAL file. When the memtable is full, it’s flushed to disk as an immutable SSTable (Sorted String Table) file in Level 0.
  2. Compactions: To keep read performance high and manage disk space, SSTables are periodically "compacted." This involves merging SSTables from higher levels into new SSTables at lower levels, discarding deleted or overwritten versions of keys. These merges and writes are a primary source of amplification.
  3. MVCC Overhead: CockroachDB is a multi-version database. Every update or delete creates a new version of a key. While older versions are eventually garbage collected, the system must manage these versions, which adds to write traffic.
  4. Replication: Data is replicated across multiple nodes. When a write occurs on a range, it’s sent to all replicas, meaning the logical write is physically written multiple times across the cluster.

The core of managing write amplification lies in understanding and tuning the factors that influence LSM-tree behavior and data versioning.

Common Causes and Fixes for Write Amplification

1. High Write Throughput with Small Writes:

  • Diagnosis: Look for a high rate of kv.write.total_ops in crdb_internal.kv_flow_control_statistics coupled with a low average value size. Also, check crdb_internal.l0_compactions for frequent L0 compactions.
  • Cause: Many small, independent writes (e.g., single-row inserts, updates to small values) lead to frequent memtable flushes and L0 compactions. Each flush creates a new SSTable, and L0 compactions are the most expensive as they involve merging potentially many small files.
  • Fix: Batch your writes. Instead of individual INSERT or UPDATE statements, use INSERT ... VALUES (...), (...), ... or UPSERT statements that include multiple rows. For complex transactions, ensure they operate on related keys that can be processed together.
    • Example Fix: If you are performing 1000 individual INSERT INTO foo VALUES (1, 'data1'); statements, change it to INSERT INTO foo VALUES (1, 'data1'), (2, 'data2'), ..., (1000, 'data1000');.
    • Why it Works: Batching reduces the number of memtable flushes and L0 compactions per unit of data, as a single larger write can fill a memtable more efficiently and result in fewer, larger SSTables being created.

2. Large Values and Frequent Updates:

  • Diagnosis: Examine crdb_internal.kv_flow_control_statistics for high kv.write.latency on ranges with large average value sizes (max_value_size from your workload). Also, monitor crdb_internal.mvcc_stats for a high gc_bytes_age_threshold or a large number of versions.
  • Cause: Updating large values frequently causes entire SSTables containing those keys to be rewritten during compactions, even if only a small portion of the value changed. This is because SSTables are immutable, and a "write" to an existing key means a new version is written, and the old version is marked for GC.
  • Fix: If possible, break down large values into smaller, independent key-value pairs. If a large blob needs to be updated, consider if only a part of it is truly changing and if that part can be a separate key.
    • Example Fix: Instead of UPDATE large_object SET data = 'new_large_data' WHERE id = 1;, consider UPSERT large_object_part1 SET data = 'new_part1_data' WHERE id = 1; and UPSERT large_object_part2 SET data = 'new_part2_data' WHERE id = 1; if data can be logically split.
    • Why it Works: Updating smaller, distinct keys results in fewer SSTables being involved in compactions for that specific update, reducing the amount of data rewritten.

3. Suboptimal max_bytes_per_range:

  • Diagnosis: Check crdb_internal.ranges for ranges that are significantly larger or smaller than the cluster’s max_bytes_per_range setting. Also, observe crdb_internal.range_split_rebalance_counts for excessive range splits or merges.
  • Cause: Ranges that are too large can lead to inefficient compactions and reads as Pebble has to process more data within a single range. Conversely, very small ranges can lead to a high number of ranges, increasing metadata overhead and potentially leading to more frequent splits and merges, which are write-intensive operations.
  • Fix: Adjust max_bytes_per_range. The default is 512MB. For workloads with very large values or high write rates, you might consider increasing this value (e.g., to 1GB or 2GB). For workloads with many small rows, you might consider decreasing it, but this is less common for write amplification.
    • Example Fix: ALTER DATABASE your_database SET CLUSTER SETTING range_max_bytes = 1073741824; -- 1GB
    • Why it Works: A larger max_bytes_per_range allows ranges to grow larger before splitting, which can lead to fewer, larger SSTables per range and potentially more efficient compactions. It reduces the overhead associated with managing a vast number of small ranges.

4. High MVCC Garbage Collection (GC) Overhead:

  • Diagnosis: Monitor crdb_internal.mvcc_stats for a high gc_bytes_age_threshold and a large live_bytes relative to total_bytes. Also, observe the gc.requests_total metric in crdb_internal.jobs for frequent GC jobs.
  • Cause: If GC is running too frequently or is unable to keep up with new writes, old versions of data accumulate. This increases the total amount of data on disk that Pebble needs to manage during compactions, amplifying writes as more versions are merged.
  • Fix: Ensure your gc.ttlseconds setting is appropriate for your data’s lifecycle. For very active datasets where data is frequently updated or deleted, a shorter TTL might be necessary. Also, ensure your cluster has sufficient I/O capacity to handle background GC.
    • Example Fix: ALTER TABLE your_table CONFIGURE ZONE USING gc.ttlseconds = 3600; -- 1 hour (Adjust your_table and the TTL value).
    • Why it Works: A shorter GC TTL means older versions of data are removed from the system more quickly, reducing the total amount of data (live and dead) that Pebble needs to process during compactions and therefore reducing write amplification.

5. Inefficient Compaction Settings:

  • Diagnosis: Observe crdb_internal.l0_compactions and crdb_internal.level_compactions for exceptionally high rates of compactions, especially L0. Check crdb_internal.lsm_metrics for high memtable_hit_rate or memtable_size.
  • Cause: Default Pebble compaction settings are generally well-tuned, but extreme workloads can benefit from adjustments. If L0 compactions are overwhelming, it indicates too many memtables are being flushed. If lower-level compactions are frequent, it might mean SSTables are too small or too numerous.
  • Fix: While direct tuning of Pebble’s compaction strategy is advanced and generally discouraged unless you have deep expertise, you can indirectly influence it by adjusting memtable flush sizes (memtable_size) and target SSTable sizes (lsm.max-table-size).
    • Example Fix: ALTER RANGE default CONFIGURE ZONE SETTINGS lsm.max_table_size = 256 * 1024 * 1024; -- 256MB (This is a significant change and requires careful consideration and testing).
    • Why it Works: Increasing lsm.max_table_size (within reasonable limits) can lead to fewer, larger SSTables at lower levels, potentially reducing the frequency of compactions between those levels. Conversely, if L0 is the bottleneck, ensuring memtables are sized appropriately to absorb writes efficiently is key.

6. Excessive Indexing on Frequently Updated Columns:

  • Diagnosis: Identify indexes on columns that are frequently updated. Use SHOW INDEX FROM your_table and correlate with write patterns.
  • Cause: Every index on a table is a separate key-value store. When a row is updated, all indexes referencing that row must also be updated. If an index is on a column that changes often, this leads to a high volume of writes per logical row update.
  • Fix: Review your indexes. Remove any secondary indexes that are not critical for query performance. Consider if an index is truly needed on a column that experiences frequent churn.
    • Example Fix: DROP INDEX IF EXISTS my_table_frequently_updated_idx ON my_table;
    • Why it Works: Removing an index eliminates the need to write updates to that index’s data structure every time the indexed column changes, directly reducing the total number of writes.

After addressing these common causes, your next potential bottleneck might be read amplification if your compaction strategy is still too aggressive or if your data distribution leads to many SSTable lookups for a single read.

Want structured learning?

Take the full Cockroachdb course →