Nullable columns in ClickHouse can silently cripple your query performance by forcing the engine to perform expensive checks on every data read.

Let’s see this in action. Imagine a simple table:

CREATE TABLE events (
    event_date Date,
    user_id UInt64,
    event_type String,
    details Nullable(String)
) ENGINE = MergeTree()
ORDER BY (event_date, user_id);

And some data:

INSERT INTO events VALUES ('2023-10-26', 123, 'login', 'Success');
INSERT INTO events VALUES ('2023-10-26', 456, 'click', NULL);
INSERT INTO events VALUES ('2023-10-27', 123, 'logout', 'Session ended');
INSERT INTO events VALUES ('2023-10-27', 789, 'view', NULL);

Now, consider a query that filters on details:

SELECT count()
FROM events
WHERE event_type = 'click';

In a non-nullable scenario, ClickHouse reads the event_type column directly. But with details Nullable(String), ClickHouse doesn’t just read the string data. It first has to read a separate bitmask (or similar mechanism) to determine if a value is present or if it’s NULL. This check happens for every row that matches event_type = 'click', even if you aren’t directly using the details column in your SELECT or WHERE clause.

The core problem is that ClickHouse’s columnar storage is optimized for contiguous blocks of data. When a column is Nullable, ClickHouse has to store two pieces of information for each "value": the actual data (if it exists) and a flag indicating its nullability. This breaks the contiguous nature of the data blocks. Instead of a single, efficient read of the data itself, ClickHouse must perform a conditional read: first check the nullability flag, and then (and only then) read the actual data if the flag indicates it’s not null. This adds overhead to both reads and writes.

The MergeTree engine, which is the workhorse for most ClickHouse deployments, relies heavily on ordered data and efficient scanning. When Nullable columns are involved, especially in ORDER BY keys or PARTITION BY clauses, the sorting and merging processes become significantly more complex and less efficient. ClickHouse has to manage the ordering of NULLs (which often get grouped at the beginning or end, but still require explicit handling) alongside the actual data. This means more CPU cycles spent during data ingestion and merging, and slower reads because the data isn’t laid out as cleanly.

Consider a query that filters on a Nullable column and uses it in an ORDER BY clause of the query itself. For example, ORDER BY details. ClickHouse must perform the nullability check for every row to determine its position in the sorted output. This isn’t just about scanning; it’s about actively constructing the sorted result set, where the presence or absence of data in a Nullable column dictates its placement.

The most insidious aspect is that queries not explicitly filtering on the Nullable column can still be impacted. If the Nullable column is part of the table’s primary key (the ORDER BY clause in MergeTree), ClickHouse must still manage its nullability during data sorting and retrieval, even if your query only selects other columns. This means the overhead is baked into the data structure itself.

The general recommendation is to avoid Nullable types whenever possible. If a field is truly optional, consider using a sentinel value that represents "not present" but is still a valid type. For strings, an empty string '' can often suffice. For numbers, 0 or -1 might work, depending on your data’s semantics. The key is to replace NULL with a concrete value that ClickHouse can treat as any other data point, allowing for contiguous storage and direct reads.

If you absolutely must represent a missing value, and a sentinel value is not semantically appropriate, it’s often better to use a separate boolean column to indicate presence, e.g., details String and details_present UInt8 (where 0 means not present, 1 means present and details contains data). This adds a column but keeps the primary data column non-nullable and therefore more performant.

The ultimate fix is to refactor your schema. For the events table, if details being NULL simply means there are no details to log for a 'click' event, you could change the schema:

CREATE TABLE events (
    event_date Date,
    user_id UInt64,
    event_type String,
    details String DEFAULT '' -- Use empty string as default for no details
) ENGINE = MergeTree()
ORDER BY (event_date, user_id);

Then, when inserting, you’d ensure NULLs are translated:

INSERT INTO events VALUES ('2023-10-26', 456, 'click', ''); -- Explicitly insert empty string

Or, if you are inserting from another source that might produce NULL, use a COALESCE or IF statement in your ingestion logic:

INSERT INTO events (event_date, user_id, event_type, details)
SELECT event_date, user_id, event_type, COALESCE(details, '')
FROM source_table;

This change ensures the details column is always a contiguous block of String data, eliminating the need for nullability checks and significantly speeding up queries that involve this column or the table’s primary key.

The next error you’ll encounter after fixing this is a "Memory limit exceeded" error during complex aggregations, because you’ve now made your data so efficient that the engine is trying to pull too much into RAM at once.

Want structured learning?

Take the full Clickhouse course →