ClickHouse compression can reduce your storage footprint by up to 60%, but picking the wrong codec can actually increase CPU usage and slow down your queries.
Let’s see this in action. Imagine we have a table events storing user activity:
CREATE TABLE events (
event_timestamp DateTime,
user_id UInt64,
event_type String,
payload String
) ENGINE = MergeTree()
ORDER BY event_timestamp;
Without any explicit codec, ClickHouse defaults to LZ4. Let’s insert some data and check its size:
-- Insert 1 million rows (example data)
INSERT INTO events (event_timestamp, user_id, event_type, payload)
SELECT
now() - rand() * 1000000,
rand() % 10000000,
'event_' || toString(rand() % 100),
repeat('x', rand() % 1024)
FROM numbers(1000000);
Now, let’s see the uncompressed and compressed sizes of the payload column. We can use system.columns for this:
SELECT
column_name,
data_uncompressed_bytes,
data_compressed_bytes
FROM system.columns
WHERE database = currentDatabase() AND table = 'events' AND column_name = 'payload';
You’ll see data_compressed_bytes is significantly smaller than data_uncompressed_bytes. Now, let’s explicitly set a different codec. We’ll alter the table:
ALTER TABLE events MODIFY COLUMN payload String CODEC(ZSTD(3));
After this, ClickHouse will re-compress the data in the background. If you query system.columns again, you’ll see the data_compressed_bytes for payload has likely decreased further.
The core problem ClickHouse compression solves is the vast amount of disk space modern applications consume. Storing raw, uncompressed data for billions of events or logs is prohibitively expensive. Compression allows you to fit more data on disk, which in turn can mean:
- Lower infrastructure costs: Less storage hardware, fewer disks.
- Faster I/O: Reading less data from disk is inherently faster, even with decompression overhead.
- Reduced network traffic: When data is moved between nodes or to analytical tools.
ClickHouse uses codecs on a per-column basis. This is crucial because different data types and patterns compress differently. A UInt64 column with sequential IDs will compress very differently than a String column containing JSON payloads.
Here are the most common codecs and their general use cases:
LZ4: The default. Fast compression and decompression, good for general-purpose use where CPU is a concern and moderate compression is acceptable. It’s a good balance.ZSTD(level): Generally provides better compression ratios thanLZ4at the cost of higher CPU usage during compression and decompression. Thelevel(1-22) controls the trade-off.ZSTD(3)is often a sweet spot, offering significant compression with manageable CPU.Delta+(Codec): TheDeltacodec stores the difference between consecutive values. This is extremely effective for time-series data or any column with a strong sequential pattern. It’s often combined with another codec likeLZ4orZSTDto compress the resulting deltas. For example,CODEC(Delta(2), LZ4).DoubleDelta+(Codec): Similar toDelta, but stores differences of differences. Useful for data where the rate of change itself is relatively constant.T64+(Codec): A specialized codec that transforms data into a format where blocks of similar values are grouped, making subsequent compression more effective. Good for columns with repeating patterns or limited distinct values.Gorilla+(Codec): Optimized for floating-point time-series data, similar toDeltabut specifically designed for floats.
The mental model is that each column is compressed independently. You can even have a mix of codecs within a single table. When data is written, ClickHouse applies the specified codec. When data is read, it’s decompressed. The key is to match the codec to the data’s characteristics.
Consider a column storing user IDs (UInt64) which are often assigned sequentially or in small batches. Delta encoding would be very effective here.
ALTER TABLE events MODIFY COLUMN user_id UInt64 CODEC(Delta, LZ4);
This tells ClickHouse to first calculate the difference between consecutive user_id values, and then compress those differences using LZ4. This can drastically reduce the size of the user_id column if there’s a lot of sequentiality.
What if you have a column that is already compressed, like a JSON string that you’ve pre-compressed before inserting? Applying another compression codec might not yield much benefit and could even increase CPU. In such cases, you might use CODEC(NONE) or just rely on the default.
The real magic happens when you combine codecs. For a DateTime column that’s always increasing, Delta is usually fantastic.
ALTER TABLE events MODIFY COLUMN event_timestamp DateTime CODEC(Delta, ZSTD(1));
Here, we’re using Delta to capture the time differences and then ZSTD(1) for a quick, efficient compression of those deltas.
The most surprising thing about ClickHouse codecs is how much they can impact query performance beyond just disk I/O. While better compression means reading less data, the decompression cost is real. A very high compression level for ZSTD might save disk space but could saturate your CPU cores during queries, making them slower than if you had used a less aggressive codec like LZ4 or ZSTD(3). The Delta and DoubleDelta codecs, by transforming the data, can sometimes make subsequent compression much more effective than the base codec alone could achieve.
To truly optimize, you need to analyze your data and test. Use system.columns to inspect data_compressed_bytes and data_uncompressed_bytes. Then, use system.query_log to monitor query durations and system.metrics (specifically QueryThreadActive and related CPU metrics) to observe CPU load after applying changes. Experiment with different codecs and levels on representative columns. For many string-heavy analytical tables, ZSTD(3) or ZSTD(5) on string columns and Delta + LZ4 or Delta + ZSTD(1) on numerical/timestamp columns is a very strong starting point.
Once you’ve mastered column-level compression, you’ll want to explore data skipping techniques.