ClickHouse doesn’t actually store "high-cardinality columns" in a way that fundamentally differs from low-cardinality ones; the problem is how you query them and how their data types interact with memory during those queries.

Let’s watch a table with a high-cardinality user_id column ingest data and then try to do a simple COUNT(DISTINCT user_id) – a query that will OOM if not handled.

-- Create a table with a variety of data types, including a high-cardinality string
CREATE TABLE user_activity (
    event_time DateTime,
    user_id String,
    activity String,
    value UInt64
) ENGINE = MergeTree()
ORDER BY event_time;

-- Insert some data. user_id will have millions of unique values.
INSERT INTO user_activity (event_time, user_id, activity, value)
SELECT
    now() - rand() % 86400,
    'user_' || toString(rand() % 10000000), -- 10 million potential unique user_ids
    if(rand() % 2, 'click', 'view'),
    rand() % 1000
FROM numbers(10000000); -- 10 million rows

-- Now, let's try the problematic query. This is likely to OOM on a moderately sized server.
-- SELECT count(DISTINCT user_id) FROM user_activity;

The COUNT(DISTINCT ...) operation, especially on String types, is the primary culprit for OOM errors when dealing with high cardinality. ClickHouse, by default, tries to build a hash set in memory to track unique values. For a String column with millions of unique entries, this hash set can quickly consume gigabytes of RAM. Each string’s actual data also needs to be stored or referenced within the set.

Here are the common causes and their fixes:

1. Naive COUNT(DISTINCT ...) on String Type

  • Diagnosis: The query itself. Running SELECT count(DISTINCT user_id) FROM user_activity; directly on a large table with many unique user_ids.
  • Cause: COUNT(DISTINCT) on a String column requires building a hash set in memory where each element is the actual string value. For 10 million unique strings, this is a huge memory footprint.
  • Fix: Use count(DISTINCT if you must have the exact count, but prefer count(DISTINCT cityHash64(user_id)) or other hash functions for approximate counts or if you only need to count unique hashes. For exact counts, consider aggregation methods that spill to disk or use specialized data structures.
    -- Approximate count (much faster, much less memory)
    SELECT count(DISTINCT cityHash64(user_id)) FROM user_activity;
    
    This works because cityHash64 produces a UInt64 hash value for each user_id. The hash set for UInt64 is vastly more memory-efficient than for arbitrary strings. The trade-off is that it’s an approximation – collisions are possible but rare for this many distinct values.
  • Fix (Exact Count with Memory Control): If you need an exact count and cannot tolerate approximation, you can use SETTINGS to control memory usage, but this can lead to slower queries or even failures if memory limits are too strict.
    SELECT count(DISTINCT user_id)
    FROM user_activity
    SETTINGS max_block_size = 100000, max_memory_usage = 10000000000; -- Example: 10GB max memory
    
    This tells ClickHouse to use up to 10GB of RAM. The max_block_size can also influence how data is processed in chunks.

2. Using String for IDs Instead of UInt64 or UUID

  • Diagnosis: Inspecting the table schema with DESCRIBE TABLE user_activity; and seeing user_id String.
  • Cause: Storing identifiers as String incurs overhead. Each string needs to be stored and compared, which is less efficient than fixed-size integer types or UUIDs for hashing and comparisons.
  • Fix: If possible, change the data type to UInt64, UUID, or an Enum type during table creation or by creating a new table and migrating data.
    -- If you can rebuild the table:
    CREATE TABLE user_activity_optimized (
        event_time DateTime,
        user_id UInt64, -- Changed from String
        activity String,
        value UInt64
    ) ENGINE = MergeTree()
    ORDER BY event_time;
    
    -- Then insert data, converting user_id if necessary:
    INSERT INTO user_activity_optimized (event_time, user_id, activity, value)
    SELECT
        event_time,
        (splitByChar('_', user_id)[2]).toUInt64(), -- Convert string ID to UInt64
        activity,
        value
    FROM user_activity;
    
    -- Now COUNT(DISTINCT user_id) is much more efficient
    SELECT count(DISTINCT user_id) FROM user_activity_optimized;
    
    This works because UInt64 is a fixed-size, highly optimizable type for hashing and set operations, consuming significantly less memory and CPU.

3. Inefficient Joins with High-Cardinality Keys

  • Diagnosis: Queries involving JOIN operations where one of the join keys is a high-cardinality String column.
  • Cause: When joining on a high-cardinality String, ClickHouse might materialize a large hash table for one of the tables in memory. If the String column is the join key, the memory required for the hash table can explode.
  • Fix: Ensure that join keys are of efficient types (UInt64, UUID). If joining on String is unavoidable, use GLOBAL JOIN judiciously, or consider pre-aggregating or hashing the key if the exact string value isn’t needed for the join logic.
    -- Example: Suppose you have a user_details table and join on user_id
    -- CREATE TABLE user_details (user_id String, name String) ENGINE = MergeTree() ORDER BY user_id;
    -- SELECT t1.activity, t2.name
    -- FROM user_activity t1
    -- JOIN user_details t2 ON t1.user_id = t2.user_id;
    
    -- If user_details.user_id is String, this can OOM.
    -- Fix: If user_details.user_id can be UInt64, change it.
    -- If not, consider hashing for the join (if name isn't strictly needed):
    -- SELECT t1.activity, t2.name
    -- FROM user_activity t1
    -- JOIN user_details t2 ON cityHash64(t1.user_id) = cityHash64(t2.user_id);
    
    Hashing the join key allows ClickHouse to build a hash table on UInt64 values, which is far more memory-efficient.

4. Large Data Blocks Being Processed

  • Diagnosis: Observing high memory usage during query execution, especially for aggregations or distinct operations, even with optimized data types.
  • Cause: ClickHouse processes data in blocks. If max_block_size is very large or if intermediate results in a complex query generate massive blocks, memory can be exhausted before data can be flushed or processed further.
  • Fix: Adjust max_block_size and max_memory_usage settings for the query or session.
    SELECT count(DISTINCT user_id)
    FROM user_activity
    SETTINGS max_block_size = 65536, max_memory_usage = 8000000000; -- Example: 8GB max memory
    
    This explicitly limits the memory ClickHouse can use for this query and controls the size of data chunks it processes, allowing for more granular memory management.

5. Insufficient max_memory_usage Setting for Server/User

  • Diagnosis: Seeing OOM errors even on queries that should be manageable with optimized types, or seeing the system.processes table show queries with memory_usage close to system limits.
  • Cause: The global or user-level max_memory_usage setting in ClickHouse is too low to accommodate the operation, even if the data types are optimal. This is common on servers with limited RAM or when multiple heavy queries run concurrently.
  • Fix: Increase the max_memory_usage setting in your users.xml configuration or through SET commands for your session.
    <!-- In users.xml -->
    <user>
        <name>my_user</name>
        <networks>
            <ip>::/0</ip>
        </networks>
        <profile>default</profile>
        <quota>default</quota>
        <max_memory_usage>15000000000</max_memory_usage> <!-- 15GB -->
    </user>
    
    Or in a session:
    SET max_memory_usage = 15000000000;
    SELECT count(DISTINCT user_id) FROM user_activity;
    
    This allows ClickHouse to allocate more RAM for its operations, preventing OOMs for legitimate, but memory-intensive, tasks.

After fixing the String user_id to UInt64 and using count(DISTINCT user_id) (which is now efficient), the next error you might hit is related to query timeouts if the data volume is still too large for the query to complete within acceptable limits, or perhaps running into max_execution_time.

Want structured learning?

Take the full Clickhouse course →