You’re probably thinking about designing your Cassandra tables like you would in a relational database: one table for users, one for orders, etc. That’s a common trap. Cassandra doesn’t do joins, and it doesn’t have a query optimizer in the traditional sense. Instead, you must design your tables to serve specific queries you know you’ll be running.

Let’s say we want to find all blog posts by a specific author, and also be able to retrieve a single post by its ID.

Here’s how you might think about it relationally:

-- users table
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    username TEXT
);

-- posts table
CREATE TABLE posts (
    post_id UUID PRIMARY KEY,
    author_id UUID,
    title TEXT,
    body TEXT,
    created_at TIMESTAMP,
    FOREIGN KEY (author_id) REFERENCES users(user_id)
);

In Cassandra, this is a recipe for pain. You can’t SELECT * FROM posts WHERE author_id = ?.

Designing for Queries in Cassandra

Instead, we’ll create tables tailored to our expected queries.

Query 1: Get all posts by a specific author, ordered by creation time.

We’ll need a table where the author_id is part of the primary key, specifically as the partition key. This ensures all posts by a given author live on the same set of nodes. To get them ordered, we’ll make created_at the clustering key.

CREATE TABLE posts_by_author (
    author_id UUID,
    created_at TIMESTAMP,
    post_id UUID,
    title TEXT,
    body TEXT,
    PRIMARY KEY (author_id, created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

Now, to get posts by author_uuid = 'a1b2c3d4-e5f6-7890-1234-567890abcdef':

SELECT * FROM posts_by_author WHERE author_id = a1b2c3d4-e5f6-7890-1234-567890abcdef;

This query will be incredibly fast because all the relevant data is already co-located. The CLUSTERING ORDER BY (created_at DESC) ensures that the results are returned in descending order of created_at, so the newest posts appear first.

Query 2: Get a single post by its ID.

For this, we need a table where post_id is the partition key. This allows direct access to the specific row containing the post.

CREATE TABLE posts_by_id (
    post_id UUID PRIMARY KEY,
    author_id UUID,
    title TEXT,
    body TEXT,
    created_at TIMESTAMP
);

To get a post with post_id = 'f0e9d8c7-b6a5-4321-fedc-ba9876543210':

SELECT * FROM posts_by_id WHERE post_id = f0e9d8c7-b6a5-4321-fedc-ba9876543210;

This query will also be very efficient, as it directly targets the partition containing the desired post.

Data Duplication is Your Friend

Notice that we have duplicated the post data. posts_by_author contains all the post fields, and posts_by_id also contains all the post fields. This is intentional. In Cassandra, it’s far better to duplicate data and have multiple tables optimized for different queries than to try and normalize and perform joins. Writing data to multiple tables is an application-level concern. When a new post is created, your application code writes to both posts_by_author and posts_by_id.

What if we need to query posts by author and date range?

We can add another table, or more commonly, add a secondary index. However, secondary indexes in Cassandra have performance implications and should be used judiciously, especially on high-cardinality columns. For this specific case, a better approach is to create a table that partitions by author and clusters by year/month/day.

CREATE TABLE posts_by_author_and_date (
    author_id UUID,
    post_date DATE, -- Or derive from timestamp and store as TEXT YYYY-MM-DD
    post_id UUID,
    title TEXT,
    body TEXT,
    PRIMARY KEY (author_id, post_date)
) WITH CLUSTERING ORDER BY (post_date DESC);

Then, to get posts by author_uuid = 'a1b2c3d4-e5f6-7890-1234-567890abcdef' in October 2023:

SELECT * FROM posts_by_author_and_date
WHERE author_id = a1b2c3d4-e5f6-7890-1234-567890abcdef
  AND post_date >= '2023-10-01' AND post_date <= '2023-10-31';

This query efficiently retrieves data because the partition key (author_id) is specified, and the clustering keys (post_date) allow for range scans.

The core principle is to anticipate your read patterns. For every query you expect to run, create a table (or a denormalized view) that is optimized for that specific query. This often means repeating data, but the performance gains for reads are substantial.

When you start thinking about your data model, imagine the SELECT statements you’ll write. Then, build your tables to satisfy those statements directly.

The next challenge is handling updates and deletions across these denormalized tables consistently.

Want structured learning?

Take the full Cassandra course →