Cassandra doesn’t read data from disk sequentially; it uses a probabilistic data structure to avoid disk seeks entirely for most reads.

Let’s watch a read happen. Imagine we want to read the row with key 123 from the users table in the my_keyspace keyspace.

SELECT * FROM my_keyspace.users WHERE token(user_id) = token('123');

Here’s what’s going on under the hood:

  1. The Bloom Filter: Before Cassandra even thinks about touching a disk file (an SSTable), it checks a Bloom filter. This is a memory-resident, space-efficient probabilistic data structure. It tells you, with high probability, if a given key might be in an SSTable. If the Bloom filter says "no," Cassandra knows the key isn’t there and skips reading that SSTable entirely. This is a massive optimization, eliminating disk I/O for a significant percentage of reads.

  2. The Memtable: Cassandra first checks its in-memory structure, the Memtable. This is where all recently written data resides before it’s flushed to disk. If the key 123 is found in the Memtable, Cassandra returns the data immediately. This is the fastest possible read path.

  3. SSTables - The Data Files: If the key isn’t in the Memtable, Cassandra looks at its SSTables. For each SSTable, it first consults the SSTable’s associated Bloom filter. If the Bloom filter says the key might be present, Cassandra then uses other metadata files for that SSTable:

    • Index File: This file contains a sorted index of all keys in the SSTable. Cassandra performs a binary search on this index to find the approximate location of the key’s data.
    • Data File: Using the offset found in the index file, Cassandra seeks to that specific location in the SSTable’s data file.
  4. The Partition Key vs. Clustering Columns: It’s crucial to understand that the Bloom filter and the index file are keyed by the partition key (which is user_id in our example). If your query only specifies the partition key, Cassandra can efficiently find the relevant partition. If your query also includes clustering columns (e.g., WHERE user_id = '123' AND signup_date = '2023-10-27'), Cassandra will find the partition and then scan within that partition’s data in the SSTable, using the clustering column values to locate the exact row.

  5. Tombstones: When you delete data, Cassandra doesn’t immediately remove it from SSTables. Instead, it writes a "tombstone" marker. During a read, Cassandra must scan past these tombstones to ensure it returns the most recent, non-deleted version of the data. This is why "delete heavy" workloads can impact read performance.

  6. Compaction: Cassandra periodically merges SSTables into larger ones through a process called compaction. This process removes deleted data and reorganizes data for better read efficiency. The effectiveness of compaction directly impacts how many SSTables Cassandra needs to check for a given read.

The magic of Bloom filters is their ability to provide a "definitely not here" answer with a very small memory footprint. The trade-off is a small chance of a "false positive" – the Bloom filter says a key might be there, but it actually isn’t. In this case, Cassandra still performs the disk seek and finds nothing, which is a wasted I/O, but far better than scanning every SSTable.

The next hurdle you’ll face is understanding how Cassandra handles data that spans multiple SSTables for the same partition key, and how the read repair mechanism ensures consistency across replicas.

Want structured learning?

Take the full Cassandra course →