CockroachDB can get OOM-killed because it’s aggressively caching data and indexes in memory, and the OS decides it needs that RAM back.

Common Causes of CockroachDB Memory Pressure and OOM Kills

  1. Excessive max_sql_memory_percent: This is the most common culprit. CockroachDB’s SQL layer aggressively caches query results, execution plans, and intermediate data structures. If this is set too high, it can consume more RAM than available, leading to OOM kills.

    • Diagnosis: Check the current setting:
      SHOW CLUSTER SETTING max_sql_memory_percent;
      
      Also, monitor memory usage per node using top or htop on the host, and look for the cockroach process. Compare this to the max_sql_memory_percent setting.
    • Fix: Reduce max_sql_memory_percent. A common starting point is 25 (25% of total node memory). If you have a lot of RAM and a well-tuned workload, you might go higher, but start conservative.
      SET CLUSTER SETTING max_sql_memory_percent = 25;
      
    • Why it works: This directly limits the amount of memory the SQL server process can use for its caches and execution, preventing it from over-allocating and triggering the OS OOM killer.
  2. Large max_சைன்_in_bytes and max_சைன்_out_bytes: These settings control the maximum size of in-memory data structures for query planning and execution. If you have very complex queries or large result sets being processed in memory, these can balloon.

    • Diagnosis: Monitor your slow query logs for queries that are taking a long time and potentially processing large amounts of data in memory. Check SHOW CLUSTER SETTING max_சைன்_in_bytes; and SHOW CLUSTER SETTING max_சைன்_out_bytes;.
    • Fix: Reduce max_சைன்_in_bytes and max_சைன்_out_bytes. Default values are often 512MB and 1GB respectively. A common reduction is to 256MB and 512MB.
      SET CLUSTER SETTING max_சைன்_in_bytes = '256MB';
      SET CLUSTER SETTING max_சைன்_out_bytes = '512MB';
      
    • Why it works: Limits the memory used for query optimization and intermediate data buffering, preventing single queries from monopolizing memory.
  3. Insufficient Node RAM / Over-provisioning for Node Count: You might simply not have enough RAM on your nodes for the workload and the configured CockroachDB memory settings.

    • Diagnosis: Check your node’s total RAM (free -h on Linux). Compare this to the sum of max_sql_memory_percent, max_சைன்_in_bytes, and the memory used by the OS and other processes. Also, look at the CockroachDB metrics for memory usage.
    • Fix: Increase the RAM on your nodes, or reduce the number of nodes if you’re over-provisioned and can consolidate.
      # Example: If nodes have 16GB RAM and you're using 80% for CRDB, that's 12.8GB.
      # If max_sql_memory_percent is 50% (6.4GB) and other components use
      # significant RAM, you'll hit limits.
      
    • Why it works: Provides more physical memory for the cockroach process to operate within, preventing the OS from needing to reclaim memory via OOM kills.
  4. High max_சைன்_cache_size: This setting controls the size of the row cache. While beneficial for performance, an overly large row cache can consume significant memory, especially on nodes with large tables or high read traffic.

    • Diagnosis: Check SHOW CLUSTER SETTING max_சைன்_cache_size;. Monitor the "sys.exec_counters.row_cache_hit_count" and "sys.exec_counters.row_cache_miss_count" metrics to see if the cache is being effectively used.
    • Fix: Reduce max_சைன்_cache_size. A common starting point is 128MB or 256MB.
      SET CLUSTER SETTING max_சைன்_cache_size = '128MB';
      
    • Why it works: Limits the memory dedicated to caching individual rows, reducing the overall memory footprint of the SQL cache.
  5. Workload Saturation / Unoptimized Queries: Complex queries that perform full table scans, large joins, or extensive aggregations can generate massive intermediate results that strain memory, even with conservative settings.

    • Diagnosis: Use EXPLAIN on slow queries to identify inefficient execution plans. Look for scans, large sorts, or hash aggregations. Monitor the "sys.exec_counters.total_sql_rows_read" and "sys.exec_counters.total_sql_rows_written" metrics.
    • Fix: Optimize queries by adding appropriate indexes, rewriting queries to be more efficient, or breaking down complex operations.
      -- Example: Add an index to speed up a common filter
      CREATE INDEX IF NOT EXISTS my_table_idx ON my_table (column_name);
      
    • Why it works: Efficient queries produce smaller intermediate results and require fewer memory-intensive operations, thus reducing overall memory pressure.
  6. Background Compactions and Merges: While CockroachDB tries to manage disk I/O and memory during background operations, a very high rate of writes or data churn can lead to significant memory usage for in-memory SSTables and merge buffers.

    • Diagnosis: Monitor disk I/O metrics and look for sustained high write rates. Check the "sys.store.gc.bytes_age" metric to understand how much data is being garbage collected.
    • Fix: This is less about a direct setting and more about workload management and tuning. If write volume is consistently too high for your hardware, you may need to scale up your nodes or consider strategies to reduce write amplification.
      # No direct command, but monitor via CockroachDB Admin UI metrics
      
    • Why it works: By reducing the overall write load or ensuring sufficient I/O capacity, the background processes have more breathing room and consume less memory for their operations.
  7. Large max_சைன்_concurrency: If you have a very high number of concurrent SQL connections and complex queries running simultaneously, the cumulative memory usage for query execution contexts can become substantial.

    • Diagnosis: Check SHOW CLUSTER SETTING max_சைன்_concurrency;. Monitor the number of active SQL connections.
    • Fix: Reduce max_சைன்_concurrency if it’s set very high and you’re not actively using that many concurrent complex queries. A typical value might be 1000.
      SET CLUSTER SETTING max_சைன்_concurrency = 1000;
      
    • Why it works: Limits the total number of concurrent query executions, thereby capping the memory used by the execution engine across all active queries.

The next error you’ll hit after fixing memory pressure is likely a connection refused if the cockroach process is still restarting too frequently due to system instability, or a context deadline exceeded if the network or other services are struggling to keep up with the restarted node.

Want structured learning?

Take the full Cockroachdb course →