Elasticsearch doesn’t actually have "slow query logs" in the traditional sense; instead, it logs "search phases" that take too long, giving you granular insight into where your search latency is hiding.
Let’s see this in action. Imagine you’ve got a cluster and you’re seeing some sluggishness. You want to know why. You’re not just looking for a slow query, but which part of the search is slow.
Here’s a typical request hitting your Elasticsearch cluster:
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "elasticsearch" } },
{ "term": { "status": "published" } }
],
"filter": [
{ "range": { "publish_date": { "gte": "2023-01-01" } } }
]
}
},
"aggs": {
"articles_by_category": {
"terms": { "field": "category.keyword" }
}
}
}
This query does a few things:
- Querying: It finds documents matching "elasticsearch" in the
titleand havingstatusas "published." - Filtering: It further restricts results to those published on or after January 1, 2023.
- Aggregating: It then groups the results by
category.keywordand counts them.
Elasticsearch breaks down the execution of such a request into several phases. The most common ones you’ll see logged when a query is deemed "slow" are:
query_cache: This phase checks if the results for this specific query (or a very similar one) are already in the query cache. If so, it serves them directly, which is lightning fast.query: This is the core search phase where Elasticsearch traverses the inverted index to find matching documents for the query part. This involves fetching terms, checking document frequencies, and scoring.rewrite: Before executing the query, Elasticsearch often rewrites complex queries (likewildcardorregexp) into simplertermqueries. This phase handles that transformation.collapse: If you’re using thecollapsefeature to group documents by a certain field, this phase handles the collapsing logic.search_throttled: If the cluster is under heavy load, Elasticsearch might throttle search requests to prevent overload.fetch: Once documents are identified, this phase retrieves the actual_sourcecontent for those documents, which can be expensive if you’re fetching many large documents.aggregations: This phase executes any aggregations defined in the query.
Enabling Slow Search Logging
To capture these slow phases, you need to configure Elasticsearch. You do this via the elasticsearch.yml configuration file or by updating the cluster settings dynamically. The key setting is index.search.slowlog.threshold.query.
Here’s how you’d set it dynamically to log queries taking longer than 2 seconds (2000ms) for the query phase and 5 seconds (5000ms) for the fetch phase:
PUT _cluster/settings
{
"persistent": {
"index.search.slowlog.threshold.query": "2s",
"index.search.slowlog.threshold.fetch": "5s",
"index.search.slowlog.threshold.aggregation": "5s",
"index.search.slowlog.threshold.suggest": "5s"
}
}
index.search.slowlog.threshold.query: This is the most common one. It logs when the index traversal and scoring part of a search takes longer than the specified time.index.search.slowlog.threshold.fetch: Logs when retrieving the actual document source (_source) for the matching hits takes too long.index.search.slowlog.threshold.aggregation: Logs when the aggregation phase exceeds the threshold.index.search.slowlog.threshold.suggest: Logs when suggest queries exceed the threshold.
You also need to ensure the slow logs are enabled for the specific index or all indices. You can set this at the cluster level or per index.
PUT _cluster/settings
{
"persistent": {
"index.search.slowlog.enabled": true
}
}
Or for a specific index:
PUT my_index/_settings
{
"index.search.slowlog.enabled": true
}
Analyzing the Logs
Once configured, slow search phases will be logged to Elasticsearch’s standard log files (usually elasticsearch.log). The log entries are quite detailed. You’ll see something like this:
[2023-10-27T10:30:00,123][WARN ][o.e.search.slowlog ] [node-1] [my_index][0] took [3150ms] on phase [query], user [elastic], id [abcdef123456], request [GET /my_index/_search?pretty { "query": { ... }}]
This tells you:
- The node where it occurred (
node-1). - The index and shard (
my_index,0). - The time taken (
3150ms). - The phase that was slow (
query). - The user making the request (
elastic). - The unique request ID (
abcdef123456). - The actual request body.
Common Causes and Fixes
-
Inefficient Query Structure:
- Diagnosis: Analyze the
requestpart of the slow log. Look forwildcardqueries on fields that are not analyzed properly, or overly broadregexpqueries. - Fix: Rewrite
wildcardqueries to usetermormatchonkeywordfields where possible. Forregexp, try to make them more specific or consider alternative indexing strategies. For example, change{"wildcard": {"user.name": "joh*"}}to{"term": {"user.name.keyword": "john"}}if appropriate. - Why it works:
wildcardandregexpqueries often require iterating through a large portion of the index terms, leading to high CPU and I/O.termandmatchonkeywordfields use the efficient inverted index directly.
- Diagnosis: Analyze the
-
Large Number of Shards:
- Diagnosis: The slow log might show the slowness occurring across many shards, or you might notice high CPU/I/O on multiple nodes for a single query.
- Fix: Reduce the number of shards per index. For example, if you have an index with 100 shards and a query is slow, consider reindexing into a new index with fewer shards, say 10.
POST _reindex { "source": { "index": "old_index" }, "dest": { "index": "new_index", "settings": { "index.number_of_shards": 10 }}}. - Why it works: Each shard requires overhead for query execution. Spreading a query across too many shards amplifies this overhead and network latency.
-
Too Many Documents per Shard:
- Diagnosis: Slow logs consistently point to
queryorfetchphases on specific shards. Monitoring tools show high disk I/O or CPU on nodes hosting these shards. - Fix: Increase the number of shards for new indices or reindex into an index with more shards. For instance, reindex to
{"index.number_of_shards": 20}. - Why it works: If a shard contains too many documents, the inverted index for that shard becomes very large, making traversal and document retrieval slower. More shards distribute the data and the workload.
- Diagnosis: Slow logs consistently point to
-
Fetching Large
_sourceFields:- Diagnosis: The slow log shows high latency specifically in the
fetchphase. The query might not have asizelimit, or it might be fetching many documents. - Fix: Use
_sourcefiltering to retrieve only necessary fields:"_source": ["field1", "field2"]. If you don’t need the_sourceat all, disable it or usestored_fields. For instance,"_source": false. - Why it works: Retrieving and serializing the entire
_sourcefor many documents is I/O and network intensive. Fetching only specific fields reduces this burden.
- Diagnosis: The slow log shows high latency specifically in the
-
Complex Aggregations:
- Diagnosis: Slow logs consistently point to the
aggregationphase. Aggregations might involve many terms, deep nesting, or complex pipeline aggregations. - Fix: Optimize aggregations. For
termsaggregations, consider usingexecution_hint: mapfor smaller cardinality fields orexecution_hint: global_ordinals(default forkeywordfields) for higher cardinality. Reduce thesizeparameter fortermsaggregations if you don’t need all buckets. Use composite aggregations for pagination instead of deep scrolling. - Why it works:
termsaggregations on high-cardinality fields require significant memory and CPU to sort and count.mapexecution can be faster if the data distribution allows for it.
- Diagnosis: Slow logs consistently point to the
-
Insufficient Hardware Resources:
- Diagnosis: Slow logs appear across many queries and phases, accompanied by high CPU, memory, or I/O utilization metrics on Elasticsearch nodes.
- Fix: Scale up or out your Elasticsearch cluster. This might mean adding more nodes, increasing RAM, or upgrading CPU.
- Why it works: The cluster simply doesn’t have enough processing power or memory bandwidth to handle the workload within acceptable timeframes.
-
Mapping Issues (e.g.,
textfields for filtering/aggregations):- Diagnosis: Queries involving
matchontextfields are slow, especially when combined with filters or aggregations. Slow logs might highlight thequeryphase. - Fix: Ensure that fields used for exact matching, filtering, or aggregations are mapped as
keyword. For example, if you have atagsfield of typetext, and you want to filter by exact tag, you should also have atags.keywordfield of typekeywordin your mapping and query that. Change your query to{"term": {"tags.keyword": "important"}}. - Why it works:
textfields are analyzed (tokenized, lowercased, etc.), creating an inverted index optimized for full-text search.keywordfields are not analyzed and store exact values, making them efficient for exact matching and aggregations.
- Diagnosis: Queries involving
After addressing these, the next error you’ll likely encounter is a circuit_breaker_exception if you’ve pushed too much data too quickly, or a too_many_buckets_exception if your aggregation sizes are still too large.