Cassandra’s Time-To-Live (TTL) feature is often presented as a simple way to automatically expire old data, but it doesn’t actually delete anything; instead, it creates tombstones that can severely degrade read performance if not managed carefully.

Let’s see what happens when TTL expires data in a Cassandra table.

Imagine a simple users table:

CREATE TABLE users (
    user_id uuid PRIMARY KEY,
    username text,
    email text,
    last_login timestamp
);

We insert some data with a TTL of 60 seconds:

INSERT INTO users (user_id, username, email, last_login)
VALUES (uuid(), 'alice', 'alice@example.com', toTimestamp(now()))
USING TTL 60;

After 60 seconds, if we try to read alice’s record, Cassandra won’t find it. But what’s happening under the hood? It’s not that the data is gone from disk. Instead, Cassandra has written a special marker, a "tombstone," indicating that this data should be considered deleted.

Here’s how it works during a read:

  1. Read Request: When you query for alice’s user_id.
  2. SSTable Scan: Cassandra checks its on-disk data files (SSTables) for this user_id.
  3. Tombstone Encountered: It finds the SSTable containing alice’s data, but alongside it, it finds the tombstone marker. This marker has a timestamp associated with it, indicating when the data was "deleted."
  4. Data Filtration: Cassandra compares the tombstone’s timestamp with the timestamp of the actual data it finds. If the tombstone is newer or the same age as the data, the data is filtered out and not returned to the client.
  5. Compaction’s Role: Eventually, during SSTable compaction, tombstones are used to decide which data to discard. However, until compaction happens, the tombstone itself, along with the deleted data (which is still on disk until compaction), must be processed during reads.

The problem arises when you have a high rate of data expiration via TTL, leading to a large number of tombstones. Each read request that might have touched expired data now has to scan for and process these tombstones. This dramatically increases read latency because Cassandra has to read more data from disk (the tombstones and the actual deleted data) and perform more filtering operations.

Consider this: if you have millions of rows expiring every hour via TTL, your read paths will increasingly spend time sifting through tombstones instead of actual live data. This is especially painful for range scans or queries that touch many partitions, as each partition might contain tombstones.

The key to managing TTL effectively is understanding that it’s not a "delete and forget" mechanism. It’s a "mark for deletion" mechanism that incurs overhead.

So, what’s the actual mechanism that makes this slow? When a read request comes in for a partition that has tombstones, Cassandra has to:

  1. Read the partition key from the index.
  2. Seek to the corresponding SSTable(s).
  3. Read the relevant row(s) from the SSTable(s).
  4. For each row, check if a tombstone exists for that particular column or row, and if the tombstone is newer than the data.
  5. If a tombstone is found and is "valid," the data is discarded.
  6. This process repeats for all relevant SSTables that might contain the data or tombstones for that partition.

The more tombstones, the more disk I/O and CPU Cassandra expends on filtering data that will ultimately be discarded. This is why a high tombstone count can cripple read performance, even if the data itself is no longer logically present.

The primary lever you control is the gc_grace_seconds setting in your table properties. This setting determines how long Cassandra waits before permanently removing data marked by tombstones during compaction. The default is 10 days. If you have a high churn rate (lots of writes and deletes/TTL expirations), and your gc_grace_seconds is too high, you can accumulate a massive number of tombstones before they are finally purged.

The most surprising truth about Cassandra’s TTL is that it’s not an optimization for removing data; it’s a mechanism that adds overhead to reads until compaction has a chance to clean up.

Let’s look at a real-world scenario. Suppose we have a table events where we store sensor readings, and we want to keep them for 24 hours using TTL:

CREATE TABLE events (
    device_id uuid,
    event_time timestamp,
    reading float,
    PRIMARY KEY (device_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC)
  AND gc_grace_seconds = 86400; -- Default 10 days (864000 seconds)

We insert data with USING TTL 86400. If we have millions of devices and each device generates an event every minute, you’re looking at a massive number of TTL expirations per hour. If a node experiences a read request for a device_id that has had many events expire in the last 24 hours, the query has to scan through all the SSTables containing that device_id’s data and tombstones.

If you query for a specific device_id and event_time that has expired, Cassandra will still have to read the partition key, potentially seek through multiple SSTables to find the partition, and then check for tombstones for that specific event_time. If a tombstone is found and is newer than the data (which it will be, if TTL has expired), the row is filtered. This process is repeated for every SSTable that contains data for that partition.

The gc_grace_seconds parameter is crucial. If you set it too low (e.g., 0 or 1 hour), and you have nodes that go down and come back up, you risk data loss if a read request arrives on a node that doesn’t have the tombstone information before compaction has occurred on all replicas. However, if you have a very high churn rate, keeping gc_grace_seconds at the default 10 days can lead to a massive buildup of tombstones. The optimal value for gc_grace_seconds is often much lower than the default, typically matching or slightly exceeding your longest TTL, but carefully considered based on your cluster’s stability and repair strategy. For example, if your longest TTL is 7 days, setting gc_grace_seconds to 7 days (604800 seconds) or 8 days (691200 seconds) might be more appropriate than 10 days.

The common misconception is that TTL automatically cleans up disk space. In reality, the data remains on disk until an SSTable compaction occurs, and the tombstones are used to mark it for removal. During this time, reads might be slowed down by the very tombstones that signal data expiration.

This behavior directly impacts the tombstone_threshold and tombstone_compaction_interval compaction strategy options. If the number of tombstones in an SSTable exceeds the tombstone_threshold (default is 100,000), compaction will be temporarily paused for that SSTable to prevent performance degradation. You’ll see messages like "Too many tombstones in …" in your logs.

The next problem you’ll likely encounter is understanding how gc_grace_seconds interacts with anti-entropy repair and potential data loss.

Want structured learning?

Take the full Cassandra course →