Cassandra doesn’t actually flush data to disk to make room for new writes; it flushes memtables to create immutable SSTables, and only then are old SSTables compacted away to free up disk space.
Let’s see this in action. Imagine we have a keyspace my_keyspace with a simple table my_table:
CREATE KEYSPACE my_keyspace WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE my_keyspace;
CREATE TABLE my_table (
id uuid PRIMARY KEY,
name text,
value int
);
Now, let’s insert some data. Each INSERT statement, or a batch of them, will land in a memtable.
cqlsh -k my_keyspace -e "INSERT INTO my_table (id, name, value) VALUES (uuid(), 'Alice', 10);"
cqlsh -k my_keyspace -e "INSERT INTO my_table (id, name, value) VALUES (uuid(), 'Bob', 20);"
# ... many more inserts ...
As these memtables fill up, Cassandra starts the process of flushing them. You can observe this in the Cassandra system log (system.log). You’ll see messages like:
INFO [memtable_flush_writer-1] 2023-10-27 10:00:00,123 FlushManager.java:138 - Flushing largest memtable (450.000 MiB) to /var/lib/cassandra/data/my_keyspace/my_table-uuid/nb-1-big-Digest.db
This flush operation creates a new SSTable file (e.g., nb-1-big-Digest.db) on disk. The old memtable is then cleared, and new writes start populating a fresh memtable. Over time, multiple SSTables will accumulate for a given table. Compaction is what eventually merges these SSTables and reclaims disk space.
The core problem Cassandra solves is providing a highly available, eventually consistent, distributed key-value store. It achieves this through a distributed hash table architecture where data is partitioned across multiple nodes. Writes are initially buffered in memory (memtables) for speed, and then asynchronously flushed to disk as immutable files (SSTables). Reads might hit memory or multiple SSTables. The "tune memtable flush" part of this deals with optimizing that in-memory-to-disk transition.
The key levers you control are primarily related to memory usage and the flush process itself. These are configured in cassandra.yaml:
memtable_flush_writers: This setting determines how many threads are available to write memtables to disk. Increasing this can speed up flushes if you have sufficient CPU and disk I/O.memtable_heap_space_in_mbandmemtable_offheap_space_in_mb: These define the total memory allocated for memtables (on-heap and off-heap, respectively). If these are too small, memtables will fill up and flush more frequently, potentially leading to more write pressure. If they are too large, flushes might take longer and could cause I/O contention.memtable_cleanup_threshold: This setting, though less directly related to flushing, impacts how aggressively old SSTables are removed after flushes and compactions. A lower threshold means older SSTables are eligible for compaction and removal sooner.
Tuning memtable_flush_writers is about balancing concurrency. If you have many cores and fast disks, you can handle more concurrent flushes. For example, on a system with 16 CPU cores, you might set memtable_flush_writers: 4. This allows up to 4 memtables to be written to disk simultaneously. If this value is too low, flushes become a bottleneck. If it’s too high, threads might spend more time contending for resources (CPU, disk I/O) than actually flushing data.
The size of your memtables directly influences flush frequency. If you have a high write throughput, you’ll want larger memtables to reduce the number of flushes per unit of time. For instance, if your workload is very write-heavy and you see frequent, small flushes, you might increase memtable_heap_space_in_mb from its default (often around 1/4 of heap_size) to 512 (MB). This gives each memtable more room before triggering a flush, reducing the overhead of the flush process itself. However, larger memtables mean longer flush times and potentially higher read latency if a read needs to scan a large, partially flushed memtable.
The core mechanism that makes memtable flushes impactful on write performance is that Cassandra guarantees durability at the time of flush. While writes are acknowledged to the client quickly because they are buffered in memory, the system doesn’t truly consider them "safe" until they’ve been written to an SSTable on disk. If a node crashes before a memtable is flushed, any data in that memtable is lost. Therefore, the flush process is critical for durability, and its efficiency directly affects how quickly Cassandra can process new writes and make them durable.
The most surprising aspect of memtable flushing is how it interacts with read repair. During a flush, Cassandra is essentially taking a snapshot of the in-memory data and writing it out. If a read occurs during a flush for data that exists in both the memtable being flushed and an existing SSTable, Cassandra must reconcile these two versions. This can involve reading from both the memtable and the SSTable, and potentially performing read repair on the conflicting versions. This is why very large memtables or very frequent flushes can sometimes coincidentally increase read latency, not just because of I/O, but because of the added complexity of data reconciliation.
The next thing you’ll likely want to tune is the compaction strategy, as it’s intimately linked to how SSTables are managed after flushing.