An Elasticsearch JVM GC pause spike means the garbage collector stopped the entire JVM for a significant amount of time, preventing Elasticsearch from processing requests.

Common Causes and Fixes:

  1. Insufficient Heap Size: The JVM is constantly trying to free up memory, leading to frequent and long garbage collection cycles.

    • Diagnosis: Check your Elasticsearch JVM heap settings in jvm.options. Look for Xms and Xmx. Also, monitor heap usage in Kibana’s Stack Monitoring or via the GET _nodes/stats/jvm API. If heap usage is consistently above 75-80%, it’s likely too small.
    • Fix: Increase heap size. A common recommendation is to set Xms and Xmx to 50% of system RAM, but never exceeding 30.5GB (due to compressed ordinary object pointers). For example, to set a 16GB heap:
      -Xms16g
      -Xmx16g
      
      Restart the Elasticsearch node after changing jvm.options.
    • Why it works: More heap gives the JVM more room to store objects, reducing the frequency with which it needs to clean up. The 30.5GB limit is a JVM optimization detail; beyond this, pointer efficiency degrades.
  2. Excessive Indexing Load: High indexing rates, especially with large documents or complex mappings, can quickly consume heap memory with new objects and temporary data structures.

    • Diagnosis: Monitor indexing throughput using GET _cat/indices?v&h=index,docs.count,docs.deleted,store.size,pri.store.size,ip,health,status,uuid and GET _nodes/stats/indices/indexing or Kibana’s Stack Monitoring. Look for high indexing requests per second and a rapid increase in index.refresh_total and index.indexing.index_total.
    • Fix:
      • Bulk API Optimization: Ensure you’re using the Bulk API effectively. Tune bulk_size (typically 5-15MB) and bulk_actions (typically 1000-5000).
      • Refresh Interval: Increase the index.refresh_interval for indices under heavy write load. For example, to set it to 30 seconds:
        PUT my-heavy-index/_settings
        {
          "index": {
            "refresh_interval": "30s"
          }
        }
        
        This reduces the frequency of Lucene segment creation, which is resource-intensive.
      • Shard Count: Too many shards per node can increase overhead. Aim for a reasonable number of primary shards per GB of heap.
    • Why it works: Efficient bulk indexing reduces overhead. A longer refresh interval delays segment creation, giving the JVM more time between intensive indexing operations. Fewer shards mean less memory and CPU per node dedicated to shard management.
  3. Large or Frequent Searches: Complex, broad, or aggregations-heavy searches can consume significant heap for result set processing, aggregations, and internal data structures.

    • Diagnosis: Use the GET _nodes/stats/indices/search API and Kibana’s Stack Monitoring to identify slow search queries and high search latency. Look for high values in search.query_total, search.query_time_in_millis, and search.fetch_total, search.fetch_time_in_millis.
    • Fix:
      • Query Optimization: Avoid script_fields and wildcard queries on high-cardinality fields. Use _source filtering to retrieve only necessary fields.
      • Pagination: Use search_after for deep pagination instead of from/size.
      • Profile API: Use the Profile API (GET my-index/_search?profile=true) to pinpoint slow parts of a query.
      • Index Design: Consider using appropriate data types (e.g., long for counts, date for time series) and avoid mapping text fields for aggregations if not necessary.
    • Why it works: Optimized queries require less temporary memory and CPU to execute. search_after avoids loading the entire result set into memory. Profiling reveals specific bottlenecks. Proper data types ensure efficient storage and retrieval.
  4. High-Frequency Data Streams / Ingestion Pipelines: Complex ingest pipelines or frequent writes to data streams can generate a lot of temporary objects and state that the GC must clean up.

    • Diagnosis: Monitor GET _nodes/stats/ingest and review your ingest pipeline definitions for complex processors or loops. Check the frequency of document ingestion.
    • Fix: Simplify ingest pipelines, reduce the number of processors, and optimize them for performance. If possible, perform some transformations before sending data to Elasticsearch.
    • Why it works: Simpler pipelines create fewer temporary objects and less complex state, reducing the GC’s workload.
  5. Too Many Open File Descriptors: While not directly a GC issue, this can lead to system instability that indirectly impacts JVM performance and can be mistaken for GC problems. Elasticsearch needs file descriptors for indices, network sockets, etc.

    • Diagnosis: Check the nofile limit for the Elasticsearch user: ulimit -n. Compare this to the number of indices and shards. You can also check GET _nodes/stats/process for open_file_descriptors.
    • Fix: Increase the nofile limit in /etc/security/limits.conf (or equivalent) for the Elasticsearch user. For example:
      elasticsearch soft nofile 65536
      elasticsearch hard nofile 65536
      
      Restart Elasticsearch after changing limits.
    • Why it works: Sufficient file descriptors prevent the OS from failing to open necessary files for indices or network connections, ensuring smooth operation.
  6. Old Generation Full GC: If the old generation of the heap is constantly filling up, the JVM will perform more aggressive "Stop-the-World" Full GCs, which are very time-consuming. This often happens when young objects are being promoted to the old generation too quickly or when old objects are not being reclaimed efficiently.

    • Diagnosis: Use GC logging (enabled via jvm.options with -Xlog:gc*) or monitoring tools like Prometheus/Grafana with the elasticsearch_jvm_gc_old_count and elasticsearch_jvm_gc_old_time_millis metrics. Look for a high frequency and duration of Full GC events.
    • Fix:
      • Tuning GC Algorithm: Elasticsearch 7.x and later default to G1GC. For older versions or specific tuning needs, experiment with G1GC parameters in jvm.options like -XX:MaxGCPauseMillis (e.g., 300) or -XX:G1HeapRegionSize.
      • Object Allocation: The underlying cause is often excessive object allocation, pointing back to indexing load, search complexity, or inefficient data structures. Review points 2 and 3.
      • Heap Size: Ensure adequate heap size (point 1).
    • Why it works: Tuning MaxGCPauseMillis tells the GC to aim for shorter pauses, potentially at the cost of slightly more frequent collections. Optimizing object allocation reduces the pressure on the old generation. A larger heap provides more buffer.

The next error you’ll likely encounter if everything else is fixed is CircuitBreakerException.

Want structured learning?

Take the full Elasticsearch course →