Elasticsearch nodes are crashing with OutOfMemoryError: Java heap space or experiencing severe performance degradation due to excessive garbage collection (GC).
Here’s why this happens and how to fix it:
Common Causes and Fixes
1. Heap Size Too Small for Workload
The most common reason is simply not allocating enough memory for Elasticsearch to manage its indices, caches, and ongoing operations. Elasticsearch is memory-hungry.
- Diagnosis: Check the current JVM heap settings in your
jvm.optionsfile (usually located at/etc/elasticsearch/jvm.optionsor within theconfigdirectory of your Elasticsearch installation). Look for lines starting with-Xmsand-Xmx.grep -E '^-Xms|-Xmx' /etc/elasticsearch/jvm.options - Fix: Increase both
-Xms(initial heap size) and-Xmx(maximum heap size) to the same value. A common recommendation is 50% of system RAM, but no more than 30-32GB (due to compressed ordinary object pointers, or "compressed oops"). For example, if you have 16GB of RAM, set both to8g:
Restart Elasticsearch for the change to take effect.-Xms8g -Xmx8g - Why it works: A larger heap gives Elasticsearch more room to store data structures, caches, and thread stacks, reducing the frequency and duration of GC pauses. Setting
-Xmsand-Xmxto the same value prevents the JVM from resizing the heap, which can cause performance hiccups.
2. Heap Size Too Large, Causing Swapping
Conversely, allocating too much heap can be just as bad. If you allocate more than 50% of system RAM, or more than the OS can comfortably manage alongside its own needs, the system might start swapping memory to disk. Swapping is catastrophic for Elasticsearch performance.
- Diagnosis: Monitor system memory usage with
free -mortop. Look for significant swap usage. On systems withsystemd,systemctl status elasticsearchmight show related errors or warnings if the service is struggling due to resource contention. - Fix: Reduce the heap size in
jvm.optionsto be no more than 50% of total system RAM, and ensure it doesn’t exceed the 30-32GB compressed oops limit. For instance, on a 32GB machine, a 12g heap is often a good starting point.
Restart Elasticsearch.-Xms12g -Xmx12g - Why it works: By staying within reasonable memory limits, you prevent the OS from resorting to slow disk swapping, ensuring Elasticsearch’s data and operations reside in fast RAM.
3. Inadequate System Memory (Overall)
Even with a correctly sized JVM heap, if the overall system doesn’t have enough RAM for the OS, file system cache, and Elasticsearch’s JVM heap combined, you’ll still face issues. Elasticsearch relies heavily on the OS’s file system cache for performance.
- Diagnosis: Use
free -mto check total RAM and available memory. A system with 32GB RAM where Elasticsearch’s heap is set to 16GB leaves only 16GB for the OS, file system cache, and other processes. - Fix: Add more RAM to the server or reduce the overall memory footprint of other processes running on the same node. If possible, dedicate nodes solely to Elasticsearch.
- Why it works: Sufficient system RAM allows the OS to effectively cache index files, and provides ample space for the JVM heap without contention.
4. Incorrect bootstrap.memory_lock Setting
If bootstrap.memory_lock is not enabled, the JVM might not be able to lock its heap memory, potentially leading to swapping even if you’ve allocated a reasonable heap size.
- Diagnosis: Check your
elasticsearch.ymlfile for thebootstrap.memory_locksetting.grep bootstrap.memory_lock /etc/elasticsearch/elasticsearch.yml - Fix: Ensure
bootstrap.memory_lock: trueis set inelasticsearch.yml. You also need to configure the OS to allow this. For Linux, this typically involves addingmemlockto thelimits.conffile for theelasticsearchuser (e.g.,elasticsearch soft memlock unlimitedandelasticsearch hard memlock unlimited) and ensuringmlockallis set totrueinjvm.options.# elasticsearch.yml bootstrap.memory_lock: true# /etc/security/limits.d/elasticsearch.conf (or limits.conf) * soft memlock unlimited * hard memlock unlimited
Restart Elasticsearch and potentially the OS or the user session for limit changes to apply.# jvm.options -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch/heapdumps -XX:+ExitOnOutOfMemoryError -Des.max-open-files=65536 -Des.security.manager.enabled=true -Xms4g -Xmx4g -XX:+AlwaysPreTouch # <--- This is important for mlockall - Why it works:
bootstrap.memory_lock: true(andmlockall) prevents the JVM heap from being swapped to disk by the operating system, guaranteeing that the heap stays in physical RAM. The+AlwaysPreTouchJVM option helps ensure all heap memory is allocated and touched at startup, makingmlockallmore effective.
5. Too Many Shards Per Node
Each shard consumes memory for its data structures, file handles, and thread pools. A high number of shards, especially on nodes with limited RAM, can lead to excessive memory pressure.
- Diagnosis: Use the Cluster Health API or Nodes Stats API to count the number of shards per node.
curl -X GET "localhost:9200/_cat/shards?v" | wc -l # or curl -X GET "localhost:9200/_nodes/stats/indices/segments?pretty" - Fix: Reduce the number of shards per node. This can be achieved by:
- Consolidating indices (e.g., using Index Lifecycle Management (ILM) to merge smaller indices).
- Increasing the number of nodes in your cluster.
- Re-evaluating your indexing strategy to use fewer shards per index.
- Why it works: Fewer shards per node means less memory overhead per node, distributing the load more evenly and reducing the chance of any single node running out of memory.
6. Inefficient Indexing or Querying
Certain indexing patterns or complex queries can temporarily consume large amounts of heap memory. For example, large bulk requests, complex aggregations, or queries that require loading significant portions of an index into memory.
- Diagnosis: Use the Nodes Hot Threads API to identify which threads are consuming CPU and potentially heap.
Analyze slow logs for problematic queries.curl -X GET "localhost:9200/_nodes/hot_threads?pretty" - Fix:
- Break down large bulk requests into smaller batches.
- Optimize complex aggregations or queries.
- Use
scrollAPI for deep pagination instead offrom/size. - Ensure your mapping is efficient and doesn’t use dynamic mapping excessively.
- Why it works: By making indexing and querying operations more efficient, you reduce the peak memory demands placed on the JVM heap.
After ensuring your JVM heap is correctly set and system resources are adequate, the next common issue you’ll encounter is CircuitBreakingException: [parent] Data too large..., indicating that Elasticsearch is running out of heap memory for requests or index buffers.