Cassandra’s disk is filling up because deleted data isn’t being removed fast enough, leading to read timeouts.
Common Causes and Fixes for Tombstone Accumulation
1. Large Deletes Across Many Partitions
- Diagnosis: Run
nodetool cfstats <keyspace_name>on each node and look fortombstones_scannedandtombstones_total. A high number oftombstones_scannedrelative totombstones_totalindicates that reads are hitting many tombstones. Also, checknodetool compactionstatsfor pending compactions. If it’s high and not decreasing, compactions are struggling to keep up. - Cause: When you issue a
DELETEstatement that affects a large number of rows (e.g.,DELETE FROM my_table WHERE partition_key = 'some_value';whensome_valuehas many rows, orDELETE FROM my_table;without aWHEREclause), Cassandra marks these rows as deleted but doesn’t immediately reclaim the disk space. These "tombstones" must be processed by compactions. If compactions can’t keep up with the rate of large deletes, tombstones accumulate. - Fix: Configure
compaction_throughput_mb_per_secincassandra.yamlto a higher value (e.g.,100or200). This allows compactions to run faster, processing tombstones more aggressively. - Why it works: Increasing compaction throughput directly speeds up the process of merging SSTables and discarding deleted data marked by tombstones.
2. Insufficient Compaction Strategy for Write/Delete Patterns
- Diagnosis: Examine the
compaction_strategyset for your table incqlsh(DESCRIBE TABLE <keyspace_name>.<table_name>;). For tables with frequent deletes and updates,LeveledCompactionStrategy(LCS) orSizeTieredCompactionStrategy(STCS) might be too slow to keep up if not tuned correctly. - Cause: STCS, the default, can lead to many SSTables and slower compactions when there are many small writes or deletes. LCS is better for read-heavy workloads and handles tombstones more efficiently but can have higher disk I/O. If your delete pattern is high-volume, the chosen strategy might not be optimal.
- Fix: For tables with high delete rates, consider switching to
LeveledCompactionStrategy. This requires a table rewrite.
After this change, you’ll need to trigger a major compaction to rewrite existing data into the new leveled structure:ALTER TABLE <keyspace_name>.<table_name> WITH compaction = {'class': 'LeveledCompactionStrategy'};nodetool compact <keyspace_name> <table_name> - Why it works: LCS organizes SSTables into levels, ensuring that compactions involve fewer SSTables and are more predictable, leading to more efficient tombstone removal over time.
3. TTL Expiration Not Keeping Up with Deletes
- Diagnosis: Check
nodetool tablehistograms <keyspace_name> <table_name>fortombstones_gc_ratioandtombstones_gc_count. Iftombstones_gc_ratiois high (e.g., > 0.5), it means more than half of the cells being scanned are tombstones. Also, check thedefault_time_to_livesetting for the table. - Cause: If you are using TTL on your data and also issuing manual
DELETEstatements, or if TTL is set too high, tombstones can accumulate faster than they are garbage collected. Cassandra’s garbage collection of tombstones is tied to SSTable lifespan and compaction. If TTL is longer than the time it takes for compactions to process SSTables, tombstones can persist. - Fix: Lower the
default_time_to_livefor the affected table or ensure that manual deletes are not outstripping TTL expiration.
For example, to set TTL to 7 days:ALTER TABLE <keyspace_name>.<table_name> WITH default_time_to_live = <new_ttl_in_seconds>;
Then, trigger a major compaction to clean up existing tombstones that are now eligible for GC based on the new TTL.ALTER TABLE <keyspace_name>.<table_name> WITH default_time_to_live = 604800;nodetool compact <keyspace_name> <table_name> - Why it works: A shorter TTL ensures that data (and its associated tombstones) becomes eligible for garbage collection by compactions sooner.
4. Low gc_grace_seconds Setting
- Diagnosis: Check the
gc_grace_secondsvalue for your table:
If it’s set to a very low value (e.g., 10 seconds or less), this can be a problem.SELECT gc_grace_seconds FROM system_schema.tables WHERE keyspace_name = '<keyspace_name>' AND table_name = '<table_name>'; - Cause:
gc_grace_secondsis a safety mechanism. It determines how long Cassandra waits before garbage collecting tombstones for data that has been deleted or expired via TTL. A low value means tombstones can be removed before all replicas have received the delete marker, potentially leading to data resurrection. However, if it’s too high, it can delay tombstone cleanup. The default is 10 days (864000 seconds). If it was intentionally lowered to speed up cleanup, and you’re now seeing accumulation, it might be too low for your cluster’s repair and network topology. - Fix: If
gc_grace_secondswas set very low (e.g.,< 1 hour), increase it to a more reasonable value like864000(10 days).
IfALTER TABLE <keyspace_name>.<table_name> WITH gc_grace_seconds = 864000;gc_grace_secondsis already high and tombstones are accumulating, the issue is likely with compaction performance or delete volume, not this setting itself. - Why it works: A sufficient
gc_grace_secondsensures that a delete marker has a high probability of reaching all replicas before the tombstone is eligible for garbage collection, preventing data from reappearing. While this delays cleanup, it’s crucial for consistency. If it’s too low, it can lead to subtle consistency issues and not necessarily tombstone accumulation, but rather data resurrection. The problem statement is about accumulation, so this is a less likely cause for accumulation itself, but a common related issue.
5. Under-provisioned Compaction Threads or I/O
- Diagnosis: Monitor your system’s CPU and disk I/O. Use
nodetool tpstatsto check theCompactionExecutorthread pool. If threads are consistently busy orpendingtasks are high, it indicates a bottleneck. - Cause: Compactions are I/O and CPU intensive. If your cluster is heavily loaded with writes and deletes, or if the underlying hardware (disks, CPUs) is slow, the compaction threads may not be able to keep up.
- Fix:
- Increase Compaction Threads: Adjust
concurrent_compactionsincassandra.yaml. A common starting point is8or16, but this depends heavily on your hardware. - Improve Disk Performance: Ensure you are using fast SSDs, ideally NVMe, and that your RAID configuration is optimal for Cassandra.
- Scale Out: Add more nodes to your cluster to distribute the load, including compactions.
# cassandra.yaml concurrent_compactions: 16 - Increase Compaction Threads: Adjust
- Why it works: More threads allow more compactions to run in parallel, and faster hardware reduces the time each compaction takes, improving the overall rate of SSTable merging and tombstone removal.
6. Unresolved Tombstones Due to Infrequent Repairs
- Diagnosis: Check
nodetool repair -proutput or use a repair tool like Cassandra Reaper. Look for nodes that are consistently out of sync or take a very long time to repair. - Cause: If nodes are down or network issues prevent
nodetool repairfrom running regularly, replicas might not receive all delete markers. When a node eventually comes back online or the network issue is resolved, it might still have "live" data that was deleted on other nodes. This can lead to tombstones being generated again on the repaired node (as it receives the "live" data and then the delete marker again), or tombstones being missed entirely. This is less about accumulation and more about persistence of tombstones that should have been cleared. - Fix: Ensure regular, full anti-entropy repairs are performed across the cluster. Use a tool like Cassandra Reaper to automate and monitor repairs.
Or use Cassandra Reaper to schedule and manage repairs.# Example of running a full repair for a keyspace nodetool repair --full <keyspace_name> - Why it works: Regular repairs ensure that all nodes have consistent data, including the correct tombstone markers, allowing compactions to eventually purge the deleted data across the entire cluster.
After fixing tombstone accumulation, you might encounter ReadTimeoutException or WriteTimeoutException on specific partitions if they are particularly large or have a high number of tombstones that are still being processed.