An Elasticsearch JVM GC pause spike means the garbage collector stopped the entire JVM for a significant amount of time, preventing Elasticsearch from processing requests.
Common Causes and Fixes:
-
Insufficient Heap Size: The JVM is constantly trying to free up memory, leading to frequent and long garbage collection cycles.
- Diagnosis: Check your Elasticsearch JVM heap settings in
jvm.options. Look forXmsandXmx. Also, monitor heap usage in Kibana’s Stack Monitoring or via theGET _nodes/stats/jvmAPI. If heap usage is consistently above 75-80%, it’s likely too small. - Fix: Increase heap size. A common recommendation is to set
XmsandXmxto 50% of system RAM, but never exceeding 30.5GB (due to compressed ordinary object pointers). For example, to set a 16GB heap:
Restart the Elasticsearch node after changing-Xms16g -Xmx16gjvm.options. - Why it works: More heap gives the JVM more room to store objects, reducing the frequency with which it needs to clean up. The 30.5GB limit is a JVM optimization detail; beyond this, pointer efficiency degrades.
- Diagnosis: Check your Elasticsearch JVM heap settings in
-
Excessive Indexing Load: High indexing rates, especially with large documents or complex mappings, can quickly consume heap memory with new objects and temporary data structures.
- Diagnosis: Monitor indexing throughput using
GET _cat/indices?v&h=index,docs.count,docs.deleted,store.size,pri.store.size,ip,health,status,uuidandGET _nodes/stats/indices/indexingor Kibana’s Stack Monitoring. Look for high indexing requests per second and a rapid increase inindex.refresh_totalandindex.indexing.index_total. - Fix:
- Bulk API Optimization: Ensure you’re using the Bulk API effectively. Tune
bulk_size(typically 5-15MB) andbulk_actions(typically 1000-5000). - Refresh Interval: Increase the
index.refresh_intervalfor indices under heavy write load. For example, to set it to 30 seconds:
This reduces the frequency of Lucene segment creation, which is resource-intensive.PUT my-heavy-index/_settings { "index": { "refresh_interval": "30s" } } - Shard Count: Too many shards per node can increase overhead. Aim for a reasonable number of primary shards per GB of heap.
- Bulk API Optimization: Ensure you’re using the Bulk API effectively. Tune
- Why it works: Efficient bulk indexing reduces overhead. A longer refresh interval delays segment creation, giving the JVM more time between intensive indexing operations. Fewer shards mean less memory and CPU per node dedicated to shard management.
- Diagnosis: Monitor indexing throughput using
-
Large or Frequent Searches: Complex, broad, or aggregations-heavy searches can consume significant heap for result set processing, aggregations, and internal data structures.
- Diagnosis: Use the
GET _nodes/stats/indices/searchAPI and Kibana’s Stack Monitoring to identify slow search queries and high search latency. Look for high values insearch.query_total,search.query_time_in_millis, andsearch.fetch_total,search.fetch_time_in_millis. - Fix:
- Query Optimization: Avoid
script_fieldsandwildcardqueries on high-cardinality fields. Use_sourcefiltering to retrieve only necessary fields. - Pagination: Use
search_afterfor deep pagination instead offrom/size. - Profile API: Use the Profile API (
GET my-index/_search?profile=true) to pinpoint slow parts of a query. - Index Design: Consider using appropriate data types (e.g.,
longfor counts,datefor time series) and avoid mappingtextfields for aggregations if not necessary.
- Query Optimization: Avoid
- Why it works: Optimized queries require less temporary memory and CPU to execute.
search_afteravoids loading the entire result set into memory. Profiling reveals specific bottlenecks. Proper data types ensure efficient storage and retrieval.
- Diagnosis: Use the
-
High-Frequency Data Streams / Ingestion Pipelines: Complex ingest pipelines or frequent writes to data streams can generate a lot of temporary objects and state that the GC must clean up.
- Diagnosis: Monitor
GET _nodes/stats/ingestand review your ingest pipeline definitions for complex processors or loops. Check the frequency of document ingestion. - Fix: Simplify ingest pipelines, reduce the number of processors, and optimize them for performance. If possible, perform some transformations before sending data to Elasticsearch.
- Why it works: Simpler pipelines create fewer temporary objects and less complex state, reducing the GC’s workload.
- Diagnosis: Monitor
-
Too Many Open File Descriptors: While not directly a GC issue, this can lead to system instability that indirectly impacts JVM performance and can be mistaken for GC problems. Elasticsearch needs file descriptors for indices, network sockets, etc.
- Diagnosis: Check the
nofilelimit for the Elasticsearch user:ulimit -n. Compare this to the number of indices and shards. You can also checkGET _nodes/stats/processforopen_file_descriptors. - Fix: Increase the
nofilelimit in/etc/security/limits.conf(or equivalent) for the Elasticsearch user. For example:
Restart Elasticsearch after changing limits.elasticsearch soft nofile 65536 elasticsearch hard nofile 65536 - Why it works: Sufficient file descriptors prevent the OS from failing to open necessary files for indices or network connections, ensuring smooth operation.
- Diagnosis: Check the
-
Old Generation Full GC: If the old generation of the heap is constantly filling up, the JVM will perform more aggressive "Stop-the-World" Full GCs, which are very time-consuming. This often happens when young objects are being promoted to the old generation too quickly or when old objects are not being reclaimed efficiently.
- Diagnosis: Use GC logging (enabled via
jvm.optionswith-Xlog:gc*) or monitoring tools like Prometheus/Grafana with theelasticsearch_jvm_gc_old_countandelasticsearch_jvm_gc_old_time_millismetrics. Look for a high frequency and duration ofFull GCevents. - Fix:
- Tuning GC Algorithm: Elasticsearch 7.x and later default to G1GC. For older versions or specific tuning needs, experiment with G1GC parameters in
jvm.optionslike-XX:MaxGCPauseMillis(e.g.,300) or-XX:G1HeapRegionSize. - Object Allocation: The underlying cause is often excessive object allocation, pointing back to indexing load, search complexity, or inefficient data structures. Review points 2 and 3.
- Heap Size: Ensure adequate heap size (point 1).
- Tuning GC Algorithm: Elasticsearch 7.x and later default to G1GC. For older versions or specific tuning needs, experiment with G1GC parameters in
- Why it works: Tuning
MaxGCPauseMillistells the GC to aim for shorter pauses, potentially at the cost of slightly more frequent collections. Optimizing object allocation reduces the pressure on the old generation. A larger heap provides more buffer.
- Diagnosis: Use GC logging (enabled via
The next error you’ll likely encounter if everything else is fixed is CircuitBreakerException.