CockroachDB can get OOM-killed because it’s aggressively caching data and indexes in memory, and the OS decides it needs that RAM back.
Common Causes of CockroachDB Memory Pressure and OOM Kills
-
Excessive
max_sql_memory_percent: This is the most common culprit. CockroachDB’s SQL layer aggressively caches query results, execution plans, and intermediate data structures. If this is set too high, it can consume more RAM than available, leading to OOM kills.- Diagnosis: Check the current setting:
Also, monitor memory usage per node usingSHOW CLUSTER SETTING max_sql_memory_percent;toporhtopon the host, and look for thecockroachprocess. Compare this to themax_sql_memory_percentsetting. - Fix: Reduce
max_sql_memory_percent. A common starting point is25(25% of total node memory). If you have a lot of RAM and a well-tuned workload, you might go higher, but start conservative.SET CLUSTER SETTING max_sql_memory_percent = 25; - Why it works: This directly limits the amount of memory the SQL server process can use for its caches and execution, preventing it from over-allocating and triggering the OS OOM killer.
- Diagnosis: Check the current setting:
-
Large
max_சைன்_in_bytesandmax_சைன்_out_bytes: These settings control the maximum size of in-memory data structures for query planning and execution. If you have very complex queries or large result sets being processed in memory, these can balloon.- Diagnosis: Monitor your slow query logs for queries that are taking a long time and potentially processing large amounts of data in memory. Check
SHOW CLUSTER SETTING max_சைன்_in_bytes;andSHOW CLUSTER SETTING max_சைன்_out_bytes;. - Fix: Reduce
max_சைன்_in_bytesandmax_சைன்_out_bytes. Default values are often512MBand1GBrespectively. A common reduction is to256MBand512MB.SET CLUSTER SETTING max_சைன்_in_bytes = '256MB'; SET CLUSTER SETTING max_சைன்_out_bytes = '512MB'; - Why it works: Limits the memory used for query optimization and intermediate data buffering, preventing single queries from monopolizing memory.
- Diagnosis: Monitor your slow query logs for queries that are taking a long time and potentially processing large amounts of data in memory. Check
-
Insufficient Node RAM / Over-provisioning for Node Count: You might simply not have enough RAM on your nodes for the workload and the configured CockroachDB memory settings.
- Diagnosis: Check your node’s total RAM (
free -hon Linux). Compare this to the sum ofmax_sql_memory_percent,max_சைன்_in_bytes, and the memory used by the OS and other processes. Also, look at the CockroachDB metrics for memory usage. - Fix: Increase the RAM on your nodes, or reduce the number of nodes if you’re over-provisioned and can consolidate.
# Example: If nodes have 16GB RAM and you're using 80% for CRDB, that's 12.8GB. # If max_sql_memory_percent is 50% (6.4GB) and other components use # significant RAM, you'll hit limits. - Why it works: Provides more physical memory for the
cockroachprocess to operate within, preventing the OS from needing to reclaim memory via OOM kills.
- Diagnosis: Check your node’s total RAM (
-
High
max_சைன்_cache_size: This setting controls the size of the row cache. While beneficial for performance, an overly large row cache can consume significant memory, especially on nodes with large tables or high read traffic.- Diagnosis: Check
SHOW CLUSTER SETTING max_சைன்_cache_size;. Monitor the "sys.exec_counters.row_cache_hit_count" and "sys.exec_counters.row_cache_miss_count" metrics to see if the cache is being effectively used. - Fix: Reduce
max_சைன்_cache_size. A common starting point is128MBor256MB.SET CLUSTER SETTING max_சைன்_cache_size = '128MB'; - Why it works: Limits the memory dedicated to caching individual rows, reducing the overall memory footprint of the SQL cache.
- Diagnosis: Check
-
Workload Saturation / Unoptimized Queries: Complex queries that perform full table scans, large joins, or extensive aggregations can generate massive intermediate results that strain memory, even with conservative settings.
- Diagnosis: Use
EXPLAINon slow queries to identify inefficient execution plans. Look for scans, large sorts, or hash aggregations. Monitor the "sys.exec_counters.total_sql_rows_read" and "sys.exec_counters.total_sql_rows_written" metrics. - Fix: Optimize queries by adding appropriate indexes, rewriting queries to be more efficient, or breaking down complex operations.
-- Example: Add an index to speed up a common filter CREATE INDEX IF NOT EXISTS my_table_idx ON my_table (column_name); - Why it works: Efficient queries produce smaller intermediate results and require fewer memory-intensive operations, thus reducing overall memory pressure.
- Diagnosis: Use
-
Background Compactions and Merges: While CockroachDB tries to manage disk I/O and memory during background operations, a very high rate of writes or data churn can lead to significant memory usage for in-memory SSTables and merge buffers.
- Diagnosis: Monitor disk I/O metrics and look for sustained high write rates. Check the "sys.store.gc.bytes_age" metric to understand how much data is being garbage collected.
- Fix: This is less about a direct setting and more about workload management and tuning. If write volume is consistently too high for your hardware, you may need to scale up your nodes or consider strategies to reduce write amplification.
# No direct command, but monitor via CockroachDB Admin UI metrics - Why it works: By reducing the overall write load or ensuring sufficient I/O capacity, the background processes have more breathing room and consume less memory for their operations.
-
Large
max_சைன்_concurrency: If you have a very high number of concurrent SQL connections and complex queries running simultaneously, the cumulative memory usage for query execution contexts can become substantial.- Diagnosis: Check
SHOW CLUSTER SETTING max_சைன்_concurrency;. Monitor the number of active SQL connections. - Fix: Reduce
max_சைன்_concurrencyif it’s set very high and you’re not actively using that many concurrent complex queries. A typical value might be1000.SET CLUSTER SETTING max_சைன்_concurrency = 1000; - Why it works: Limits the total number of concurrent query executions, thereby capping the memory used by the execution engine across all active queries.
- Diagnosis: Check
The next error you’ll hit after fixing memory pressure is likely a connection refused if the cockroach process is still restarting too frequently due to system instability, or a context deadline exceeded if the network or other services are struggling to keep up with the restarted node.