Cassandra’s disk is filling up because deleted data isn’t being removed fast enough, leading to read timeouts.

Common Causes and Fixes for Tombstone Accumulation

1. Large Deletes Across Many Partitions

  • Diagnosis: Run nodetool cfstats <keyspace_name> on each node and look for tombstones_scanned and tombstones_total. A high number of tombstones_scanned relative to tombstones_total indicates that reads are hitting many tombstones. Also, check nodetool compactionstats for pending compactions. If it’s high and not decreasing, compactions are struggling to keep up.
  • Cause: When you issue a DELETE statement that affects a large number of rows (e.g., DELETE FROM my_table WHERE partition_key = 'some_value'; when some_value has many rows, or DELETE FROM my_table; without a WHERE clause), Cassandra marks these rows as deleted but doesn’t immediately reclaim the disk space. These "tombstones" must be processed by compactions. If compactions can’t keep up with the rate of large deletes, tombstones accumulate.
  • Fix: Configure compaction_throughput_mb_per_sec in cassandra.yaml to a higher value (e.g., 100 or 200). This allows compactions to run faster, processing tombstones more aggressively.
  • Why it works: Increasing compaction throughput directly speeds up the process of merging SSTables and discarding deleted data marked by tombstones.

2. Insufficient Compaction Strategy for Write/Delete Patterns

  • Diagnosis: Examine the compaction_strategy set for your table in cqlsh (DESCRIBE TABLE <keyspace_name>.<table_name>;). For tables with frequent deletes and updates, LeveledCompactionStrategy (LCS) or SizeTieredCompactionStrategy (STCS) might be too slow to keep up if not tuned correctly.
  • Cause: STCS, the default, can lead to many SSTables and slower compactions when there are many small writes or deletes. LCS is better for read-heavy workloads and handles tombstones more efficiently but can have higher disk I/O. If your delete pattern is high-volume, the chosen strategy might not be optimal.
  • Fix: For tables with high delete rates, consider switching to LeveledCompactionStrategy. This requires a table rewrite.
    ALTER TABLE <keyspace_name>.<table_name> WITH compaction = {'class': 'LeveledCompactionStrategy'};
    
    After this change, you’ll need to trigger a major compaction to rewrite existing data into the new leveled structure:
    nodetool compact <keyspace_name> <table_name>
    
  • Why it works: LCS organizes SSTables into levels, ensuring that compactions involve fewer SSTables and are more predictable, leading to more efficient tombstone removal over time.

3. TTL Expiration Not Keeping Up with Deletes

  • Diagnosis: Check nodetool tablehistograms <keyspace_name> <table_name> for tombstones_gc_ratio and tombstones_gc_count. If tombstones_gc_ratio is high (e.g., > 0.5), it means more than half of the cells being scanned are tombstones. Also, check the default_time_to_live setting for the table.
  • Cause: If you are using TTL on your data and also issuing manual DELETE statements, or if TTL is set too high, tombstones can accumulate faster than they are garbage collected. Cassandra’s garbage collection of tombstones is tied to SSTable lifespan and compaction. If TTL is longer than the time it takes for compactions to process SSTables, tombstones can persist.
  • Fix: Lower the default_time_to_live for the affected table or ensure that manual deletes are not outstripping TTL expiration.
    ALTER TABLE <keyspace_name>.<table_name> WITH default_time_to_live = <new_ttl_in_seconds>;
    
    For example, to set TTL to 7 days:
    ALTER TABLE <keyspace_name>.<table_name> WITH default_time_to_live = 604800;
    
    Then, trigger a major compaction to clean up existing tombstones that are now eligible for GC based on the new TTL.
    nodetool compact <keyspace_name> <table_name>
    
  • Why it works: A shorter TTL ensures that data (and its associated tombstones) becomes eligible for garbage collection by compactions sooner.

4. Low gc_grace_seconds Setting

  • Diagnosis: Check the gc_grace_seconds value for your table:
    SELECT gc_grace_seconds FROM system_schema.tables WHERE keyspace_name = '<keyspace_name>' AND table_name = '<table_name>';
    
    If it’s set to a very low value (e.g., 10 seconds or less), this can be a problem.
  • Cause: gc_grace_seconds is a safety mechanism. It determines how long Cassandra waits before garbage collecting tombstones for data that has been deleted or expired via TTL. A low value means tombstones can be removed before all replicas have received the delete marker, potentially leading to data resurrection. However, if it’s too high, it can delay tombstone cleanup. The default is 10 days (864000 seconds). If it was intentionally lowered to speed up cleanup, and you’re now seeing accumulation, it might be too low for your cluster’s repair and network topology.
  • Fix: If gc_grace_seconds was set very low (e.g., < 1 hour), increase it to a more reasonable value like 864000 (10 days).
    ALTER TABLE <keyspace_name>.<table_name> WITH gc_grace_seconds = 864000;
    
    If gc_grace_seconds is already high and tombstones are accumulating, the issue is likely with compaction performance or delete volume, not this setting itself.
  • Why it works: A sufficient gc_grace_seconds ensures that a delete marker has a high probability of reaching all replicas before the tombstone is eligible for garbage collection, preventing data from reappearing. While this delays cleanup, it’s crucial for consistency. If it’s too low, it can lead to subtle consistency issues and not necessarily tombstone accumulation, but rather data resurrection. The problem statement is about accumulation, so this is a less likely cause for accumulation itself, but a common related issue.

5. Under-provisioned Compaction Threads or I/O

  • Diagnosis: Monitor your system’s CPU and disk I/O. Use nodetool tpstats to check the CompactionExecutor thread pool. If threads are consistently busy or pending tasks are high, it indicates a bottleneck.
  • Cause: Compactions are I/O and CPU intensive. If your cluster is heavily loaded with writes and deletes, or if the underlying hardware (disks, CPUs) is slow, the compaction threads may not be able to keep up.
  • Fix:
    • Increase Compaction Threads: Adjust concurrent_compactions in cassandra.yaml. A common starting point is 8 or 16, but this depends heavily on your hardware.
    • Improve Disk Performance: Ensure you are using fast SSDs, ideally NVMe, and that your RAID configuration is optimal for Cassandra.
    • Scale Out: Add more nodes to your cluster to distribute the load, including compactions.
    # cassandra.yaml
    concurrent_compactions: 16
    
  • Why it works: More threads allow more compactions to run in parallel, and faster hardware reduces the time each compaction takes, improving the overall rate of SSTable merging and tombstone removal.

6. Unresolved Tombstones Due to Infrequent Repairs

  • Diagnosis: Check nodetool repair -pr output or use a repair tool like Cassandra Reaper. Look for nodes that are consistently out of sync or take a very long time to repair.
  • Cause: If nodes are down or network issues prevent nodetool repair from running regularly, replicas might not receive all delete markers. When a node eventually comes back online or the network issue is resolved, it might still have "live" data that was deleted on other nodes. This can lead to tombstones being generated again on the repaired node (as it receives the "live" data and then the delete marker again), or tombstones being missed entirely. This is less about accumulation and more about persistence of tombstones that should have been cleared.
  • Fix: Ensure regular, full anti-entropy repairs are performed across the cluster. Use a tool like Cassandra Reaper to automate and monitor repairs.
    # Example of running a full repair for a keyspace
    nodetool repair --full <keyspace_name>
    
    Or use Cassandra Reaper to schedule and manage repairs.
  • Why it works: Regular repairs ensure that all nodes have consistent data, including the correct tombstone markers, allowing compactions to eventually purge the deleted data across the entire cluster.

After fixing tombstone accumulation, you might encounter ReadTimeoutException or WriteTimeoutException on specific partitions if they are particularly large or have a high number of tombstones that are still being processed.

Want structured learning?

Take the full Cassandra course →