Cassandra’s compaction strategy is the single most impactful decision you’ll make for optimizing disk I/O and query performance.

Let’s see what that looks like in practice. Imagine a users table:

CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    username text,
    email text,
    last_login timestamp
);

We’re going to insert a bunch of data:

from cassandra.cluster import Cluster
from uuid import uuid4
import time

cluster = Cluster(['127.0.0.1'])
session = cluster.connect('my_keyspace')

# Ensure the table exists
session.execute("""
    CREATE TABLE IF NOT EXISTS users (
        user_id uuid PRIMARY KEY,
        username text,
        email text,
        last_login timestamp
    );
""")

# Insert 1000 new users
for i in range(1000):
    user_id = uuid4()
    username = f"user_{i}_{uuid4()}"
    email = f"user_{i}@example.com"
    session.execute(
        "INSERT INTO users (user_id, username, email, last_login) VALUES (%s, %s, %s, %s)",
        (user_id, username, email, time.time())
    )
print("Inserted 1000 users.")

Now, what happens to these inserts on disk? Cassandra writes them as memtable flushes, creating SSTables. Compaction is the process of merging these SSTables to reduce the number of files on disk, improve read performance, and reclaim space from deleted or overwritten data.

The core problem Cassandra solves with compaction is managing the continuous stream of writes without overwhelming the disk. Writes are fast because they go to memory and are appended to an ever-growing log. Reads, however, would be slow if they had to scan many individual SSTables. Compaction merges these smaller SSTables into larger ones, optimizing them for reads.

Here’s how you control it:

ALTER TABLE users WITH compaction = {
    'class': 'SizeTieredCompactionStrategy',
    'min_threshold': '4',
    'max_threshold': '32'
};

The class parameter is where you choose your strategy. SizeTieredCompactionStrategy (STCS) is the default and a good starting point. min_threshold and max_threshold are tunable parameters for STCS, controlling when compactions are triggered. When a certain number of SSTables (between min_threshold and max_threshold) reach a similar size, they are merged into a larger SSTable.

The key levers you control are:

  • Compaction Strategy Class: The fundamental algorithm for merging SSTables.
  • Compaction Throttling: Limits on the number of concurrent compactions and the I/O bandwidth they consume.
  • Compaction Window (TimeWindowCompactionStrategy): For time-series data, this groups SSTables within specific time windows for compaction, preventing older data from being merged with very recent data.
  • pending_compactions: A metric to watch. If this number grows consistently, your compactions aren’t keeping up with writes.
  • bytes_compacted: Another metric, shows the volume of data being processed.

The most surprising thing about Cassandra compaction is that it’s not a background task that just happens. It’s a core, active process that directly impacts your read latency and disk utilization. If you’re not actively monitoring and tuning it, you’re likely leaving performance on the table or heading for an I/O bottleneck.

When you choose LeveledCompactionStrategy (LCS), it aims to keep SSTable sizes more uniform by assigning them to "levels." An SSTable from Level 0 might be merged with a subset of SSTables from Level 1 to create new SSTables in Level 1. This process continues up through the levels. The key benefit is that read performance becomes more predictable because the number of SSTables a read needs to check is limited by the number of levels, not just the total number of SSTables on disk.

Want structured learning?

Take the full Cassandra course →