Elasticsearch aggregations can feel like magic, but when they grind to a halt on large datasets, that magic turns into a frustrating bottleneck.

Here’s a live example of a slow aggregation and how we’ll speed it up. Imagine we have a dataset of millions of web server logs, and we want to count the unique IP addresses that accessed a specific page in the last 24 hours.

GET /logs-*/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "response_code": 200 } },
        { "range": { "@timestamp": { "gte": "now-1d/d", "lt": "now/d" } } },
        { "term": { "request_path": "/api/users" } }
      ]
    }
  },
  "aggs": {
    "unique_ips": {
      "cardinality": {
        "field": "client_ip.keyword",
        "precision_threshold": 10000
      }
    }
  },
  "size": 0
}

This query uses a cardinality aggregation to count unique values. On a small dataset, it’s lightning fast. On terabytes of logs, it can take minutes, or even time out. The problem isn’t usually the query itself, but how Elasticsearch handles the sheer volume of data and the computational cost of unique counting across many shards.

The primary culprit for slow aggregations on large datasets is the cardinality aggregation’s computational intensity. Counting unique items (cardinality) across billions of documents, especially when distributed across many shards, requires significant memory and CPU. Each shard must compute its local cardinality, and then Elasticsearch must merge these results. This merging process, especially with high precision_threshold values, can become a major bottleneck.

Cause 1: Inaccurate precision_threshold

The precision_threshold in a cardinality aggregation is a hint to Elasticsearch about the expected number of unique values. If this number is set too high, Elasticsearch will use a more accurate but computationally expensive algorithm (like HyperLogLog++ with a large k value). If it’s too low, you might get inaccurate results but much faster performance.

  • Diagnosis: Examine your aggregation’s precision_threshold. If it’s set to a very high number (e.g., 100000 or more) and you suspect it’s not truly that many unique values, this is a prime suspect.
  • Fix: Lower the precision_threshold to a value that balances accuracy with performance needs. For many use cases, 1000 or 10000 is often sufficient.
    "cardinality": {
      "field": "client_ip.keyword",
      "precision_threshold": 1000
    }
    
  • Why it works: A lower precision_threshold allows Elasticsearch to use a more memory-efficient variant of the HyperLogLog++ algorithm, reducing the computational load on each shard and during the merge phase.

Cause 2: Insufficient Heap Size for Elasticsearch Nodes

Cardinality aggregations, particularly on large datasets, can consume a significant amount of heap memory. If your Elasticsearch nodes don’t have enough JVM heap allocated, they will struggle to process these aggregations, leading to slow performance or even OutOfMemory errors.

  • Diagnosis: Monitor your Elasticsearch nodes’ JVM heap usage. Tools like Kibana’s Stack Monitoring or external monitoring solutions (Prometheus/Grafana, Datadog) are essential. Look for sustained high heap usage (above 80-90%) during aggregation queries.
  • Fix: Increase the JVM heap size for your Elasticsearch nodes. This is typically done by modifying the jvm.options file (e.g., /etc/elasticsearch/jvm.options or /usr/share/elasticsearch/config/jvm.options). Set Xms and Xmx to the same value, typically 50% of system RAM, but not exceeding 30-32GB. For example, to set it to 16GB:
    -Xms16g
    -Xmx16g
    
    Restart your Elasticsearch nodes after making this change.
  • Why it works: Providing more heap memory allows Elasticsearch to hold more intermediate aggregation data in memory, reducing the need for disk swapping and garbage collection cycles that can stall aggregation processing.

Cause 3: High Cardinality Fields in terms Aggregations

While the example uses cardinality, terms aggregations on fields with a very high number of unique values (high cardinality) can also be slow. Elasticsearch has to collect and sort all unique terms before returning them.

  • Diagnosis: If you’re using a terms aggregation and it’s slow, inspect the field you’re aggregating on. Use a cardinality aggregation first to estimate the number of unique values. If it’s in the millions, this is likely the problem.
  • Fix: Limit the number of terms returned using the size parameter in the terms aggregation. If you need to explore beyond the top N terms, consider alternative approaches like sampling or using Elasticsearch’s Composite Aggregations.
    "aggs": {
      "top_ips": {
        "terms": {
          "field": "client_ip.keyword",
          "size": 100
        }
      }
    }
    
  • Why it works: Limiting the size reduces the amount of data Elasticsearch needs to sort and transfer for the terms aggregation, making it much faster.

Cause 4: Aggregating on a text Field Instead of keyword

Aggregations, especially cardinality and terms, are designed to work on exact values. If you’re aggregating on a text field, Elasticsearch will try to aggregate on the individual analyzed tokens, not the original string, leading to incorrect and often very slow results.

  • Diagnosis: Check your index mapping for the field you’re aggregating on. If it’s mapped as text, this is the problem.
  • Fix: Ensure you are aggregating on the .keyword sub-field (if using dynamic mapping) or a field explicitly mapped as keyword.
    "aggs": {
      "unique_ips": {
        "cardinality": {
          "field": "client_ip.keyword"
        }
      }
    }
    
  • Why it works: The keyword field stores the entire string value as a single token, allowing for accurate and efficient aggregations on exact values.

Cause 5: Too Many Shards for the Dataset Size

While more shards can help with indexing and search throughput, an excessive number of shards for a given dataset size can hurt aggregation performance. Each shard needs to do its own aggregation work, and the overhead of coordinating and merging results from too many small shards can outweigh the benefits.

  • Diagnosis: Check your index settings (GET /<your_index_name>/_settings?pretty). If you have hundreds or thousands of shards for a relatively small amount of data (e.g., a few GB per shard), this might be an issue.
  • Fix: Consolidate your shards. This is a more involved process that usually involves reindexing your data into a new index with fewer, larger shards. For example, if you have 100 shards of 1GB each, consider reindexing into an index with 10 shards of 10GB each.
    POST /_reindex
    {
      "source": {
        "index": "logs-*"
      },
      "dest": {
        "index": "logs-consolidated",
        "settings": {
          "index": {
            "number_of_shards": 10,
            "number_of_replicas": 1
          }
        }
      }
    }
    
  • Why it works: Fewer, larger shards reduce the coordination overhead and the number of merge operations Elasticsearch needs to perform, leading to faster aggregations.

Cause 6: Aggregation Filter Pushed Down Inefficiently

Sometimes, the filters applied within the aggregation (not the main query filter) can be inefficient. If you have complex filters inside an aggregation, or if the data distribution means these filters are hitting many documents on each shard, it can slow things down.

  • Diagnosis: Examine the structure of your aggregations. If you have nested aggregations with filters, or if the filter clause within an aggregation is complex and not hitting many documents, it could be a bottleneck.
  • Fix: Push down as much filtering as possible into the main query part of the search request. Elasticsearch is highly optimized for query-time filtering.
    GET /logs-*/_search
    {
      "query": {
        "bool": {
          "filter": [
            { "term": { "response_code": 200 } },
            { "range": { "@timestamp": { "gte": "now-1d/d", "lt": "now/d" } } },
            { "term": { "request_path": "/api/users" } }
          ]
        }
      },
      "aggs": {
        "unique_ips": {
          "cardinality": {
            "field": "client_ip.keyword",
            "precision_threshold": 1000
          }
        }
      },
      "size": 0
    }
    
    (Note: This is the same as the original example, emphasizing that the main query filter is the most efficient place.)
  • Why it works: Elasticsearch’s query optimizer is best at handling filters in the main query body, ensuring that data is filtered early and efficiently before aggregation work begins.

After addressing these common causes, your aggregation performance should improve dramatically. The next hurdle you’ll likely encounter is dealing with scenarios where even optimized aggregations are too slow for real-time dashboards, pushing you towards solutions like Elasticsearch’s Transforms or pre-aggregating data.

Want structured learning?

Take the full Elasticsearch course →