Profile ClickHouse Memory Usage to Prevent OOM Kills (2026)

ClickHouse can appear to consume an exorbitant amount of RAM, often leading to OOM kills, but its memory management is more nuanced than a simple leak.

Let’s see it in action. Imagine we’re running a fairly standard ClickHouse setup and we start seeing OOMs. We can poke around using clickhouse-client.

SELECT
    name,
    value
FROM system.events
WHERE event = 'MemoryException' OR event LIKE '%MemoryLimit%' OR event LIKE '%OOM%'
ORDER BY event;

This query tells us if the server is actively complaining about memory. If we see MemoryExceptions, it’s a strong hint. Now, let’s dive into the memory consumers themselves.

SELECT
    name,
    current_memory_usage,
    total_memory_usage
FROM system.processes
ORDER BY current_memory_usage DESC
LIMIT 10;

This gives us a snapshot of what’s using memory right now within active queries. But the real culprit is often not in system.processes because those are transient. We need to look at ClickHouse’s internal memory accounting.

SELECT
    name,
    value
FROM system.metrics
WHERE metric LIKE '%Memory%'
ORDER BY value DESC
LIMIT 20;

This system.metrics table is our primary tool. It shows a more persistent view of memory usage across different ClickHouse components. Here are the most common things to look for that can cause OOMs:

GlobalThread: This metric represents memory allocated by threads that aren’t tied to a specific query. It’s often dominated by memory used for background merges, mutations, and dictionary loading.
- Diagnosis: Check system.merges and system.mutations for excessive activity. Look at system.dictionaries for large loaded dictionaries.
- Fix:
  - For merges: Adjust max_concurrent_merges_in_one_partition in config.xml (or users.xml). If it’s too high, many merges might run concurrently, each holding onto memory. Setting it to 1 or 2 can drastically reduce this.
  - For mutations: Consider disabling or limiting mutations if they’re not critical. ALTER ... UPDATE/DELETE operations can be memory-intensive.
  - For dictionaries: Optimize dictionary loading. If a dictionary is too large, consider reducing its scope, using a different loading strategy (e.g., RANGE_HASH for smaller sets), or pre-processing it. Ensure max_memory_usage_for_dictionaries is set appropriately in users.xml.
- Why it works: Reduces the number of concurrent memory-allocating background tasks or limits the memory available to specific resource-intensive features.
QueryThread: Memory used by query execution threads. This is often the most visible consumer during active queries.
- Diagnosis: Examine system.processes for queries consuming large amounts of current_memory_usage. Look at the query text to understand what it’s doing (e.g., large GROUP BY, ARRAY JOIN, ORDER BY on unindexed columns).
- Fix:
  - Set max_memory_usage in users.xml for specific users or globally. This is a hard limit per query. Example: max_memory_usage = 10000000000 (10 GB).
  - Optimize queries. Avoid large aggregations without pre-aggregation, use LIMIT where appropriate, and ensure data is sorted for ORDER BY clauses.
  - Increase max_threads if queries are CPU-bound and memory is available, allowing them to finish faster and release memory sooner. However, this can increase peak memory usage per query if not careful.
- Why it works: Enforces a hard limit on how much memory a single query can consume, preventing runaway queries from taking down the server.
MergeTreeData: Memory used by the MergeTree engine for caching data parts (column data, primary key indexes).
- Diagnosis: High MergeTreeData usage can indicate that many large data parts are being actively read. Check system.parts for tables with many parts or large part sizes.
- Fix:
  - Adjust max_server_memory_for_merge_tree in config.xml or users.xml. This caps the total memory used by MergeTree data caches across the server. Example: max_server_memory_for_merge_tree = 50000000000 (50 GB).
  - Review table partitioning and merge_tree settings. Frequent small inserts can lead to many small parts, increasing cache pressure. Ensure merge_selecting_task_max_time_to_execute is not excessively long, leading to slow merges and many parts.
- Why it works: Limits the total amount of RAM ClickHouse can use for caching on-disk data, forcing older, less-used data out of cache.
ZooKeeper: If ClickHouse is configured to use ZooKeeper for replication or distributed query coordination, this metric reflects memory used by the ZooKeeper client library.
- Diagnosis: Check if system.zookeeper shows active connections and if the ZooKeeper metric in system.metrics is consistently high.
- Fix: This is rarely the primary cause of OOMs unless the ZooKeeper connection itself is malformed or experiencing extreme churn. Ensure ZooKeeper is healthy. If it’s a consistent large consumer, it might indicate issues with replication metadata or distributed query metadata being generated excessively. Check replication settings and distributed table usage.
- Why it works: Addresses potential issues in how ClickHouse interacts with ZooKeeper, though direct tuning here is less common than for other metrics.
SystemMemory: This is a fallback metric and often reflects ClickHouse’s overall memory footprint that isn’t categorized elsewhere.
- Diagnosis: If SystemMemory is high and other specific metrics are not clearly dominating, it suggests memory is being used by less obvious components. This could include internal buffers, caches for query plans, or memory used by UDFs.
- Fix:
  - Review max_memory_usage (global limit) in config.xml or users.xml. This is the ultimate ceiling.
  - Check for custom dictionaries, UDFs, or complex SYSTEM table queries that might be allocating significant memory.
  - Ensure ClickHouse is running on an adequately sized instance. Sometimes, the server simply doesn’t have enough RAM for its intended workload.
- Why it works: Provides an overarching safety net by limiting the total memory ClickHouse can request from the OS.
Ephemeral: Memory used for temporary data structures during query execution, often related to sorting, hash tables, and intermediate results.
- Diagnosis: This metric is closely tied to QueryThread and specific query operations. Large GROUP BY operations, ORDER BY on unsorted data, or complex joins can spike this.
- Fix:
  - Optimize queries to reduce the need for large temporary structures. For example, pre-aggregate data or ensure ORDER BY clauses match data sorting.
  - Increase max_memory_usage if the query is legitimately complex and needs more memory, but be cautious.
  - Consider tuning max_block_size (though this is more about I/O efficiency) and max_insert_block_size as they can indirectly affect how much data is processed in memory at once.
- Why it works: By optimizing the query or providing more memory headroom, you allow these temporary structures to be built without exceeding system limits.

After applying these fixes, the next error you’re likely to encounter is a SYSTEM IS_HEALTHY message, indicating that your ClickHouse instance is now stable and no longer OOMing due to memory pressure.