Your Elasticsearch cluster is showing yellow or red health, meaning some data might be unavailable or the cluster is unstable.
Common Causes and Fixes for Elasticsearch Yellow/Red Cluster Health
1. Disk Space Full on a Node
-
Diagnosis: Check disk usage on each Elasticsearch node.
curl -X GET "localhost:9200/_cat/allocation?v"Look for nodes with high
disk.used_percentvalues, especially those near 100%. -
Fix: Free up disk space. This could involve deleting old indices, snapshots, or log files, or adding more disk capacity.
- Delete Old Indices:
curl -X DELETE "localhost:9200/my-old-index-2023.01.01" - Why it works: Elasticsearch requires free disk space for shard operations, indexing, and flushing. When a disk is full, new shards cannot be allocated, and existing ones might become unassigned, leading to yellow (unassigned replica) or red (unassigned primary) status.
- Delete Old Indices:
2. Unassigned Shards (Too Many)
-
Diagnosis: Identify which shards are unassigned and why.
curl -X GET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason" | grep UNASSIGNEDThe
unassigned.reasonfield is crucial. Common reasons includeCLUSTER_RECOVERED,NODE_LEFT,ALLOCATION_FAILED,DANGLING_INDEX_IMPORTED, orNO_VALID_SHARD_COPY. -
Fix:
- If
NODE_LEFT: If a node has recently left the cluster, Elasticsearch will try to reallocate its shards. Ensure the node is back online or that you’ve properly removed it from the cluster configuration. If the node is permanently gone, Elasticsearch will eventually reallocate its shards to other nodes once it times out the node’s presence. - If
ALLOCATION_FAILED: This often points to a disk issue on the node where the shard was supposed to be allocated. Check disk space (as above) or node logs for specific errors. If the node is truly unhealthy or has no space, Elasticsearch will try another node. - If
NO_VALID_SHARD_COPY: This can happen if a primary shard is lost and no replica can be promoted. This is a more severe scenario. You might need to force a shard allocation if you’re certain the data is on disk somewhere, but this is risky. - Why it works: Elasticsearch’s primary goal is to maintain data availability and integrity. Unassigned shards indicate a failure in this process, and the fixes aim to resolve the underlying issue preventing shard allocation or to guide Elasticsearch on how to proceed.
- If
3. Insufficient Master Nodes (Red Health)
-
Diagnosis: Check the master node count and status.
curl -X GET "localhost:9200/_cat/master?v"Ensure you have an odd number of master-eligible nodes (usually 3 for production) and that they are all healthy and elected a master.
-
Fix: Ensure you have an odd number of master-eligible nodes configured (
discovery.seed_hosts,cluster.initial_master_nodes). If a master node is down, the cluster can become red if it cannot form a quorum.- Why it works: Elasticsearch uses a quorum-based system for master election to prevent split-brain scenarios. With an odd number of master-eligible nodes, a majority can always be formed, ensuring a stable master. If a majority cannot be formed (e.g., 2 out of 3 nodes are down), no master can be elected, leading to red health.
4. Network Issues Between Nodes
-
Diagnosis: Check connectivity and firewall rules between Elasticsearch nodes. Ensure nodes can reach each other on the transport port (default 9300).
# From node A, try to ping node B's transport port telnet <node_b_ip> 9300Also, check Elasticsearch logs on the affected nodes for messages related to node discovery or communication timeouts.
-
Fix: Resolve network connectivity issues. This might involve adjusting firewall rules, ensuring DNS resolution is correct, or fixing routing problems.
- Why it works: Elasticsearch nodes constantly communicate with each other to share cluster state, replicate data, and elect masters. Network partitions or blocked ports prevent this essential communication, leading to nodes thinking others have left the cluster or failing to join, causing instability and red/yellow health.
5. Indexing Backlog and Slow Writes
-
Diagnosis: Monitor indexing rates and latency.
curl -X GET "localhost:9200/_cat/indices?v&h=index,health,status,docs.count,docs.deleted,store.size,pri.store.size" curl -X GET "localhost:9200/_nodes/stats/indices/indexing?pretty"Look for high indexing rates that are not keeping up with the cluster’s capacity, or increasing latency in indexing requests.
-
Fix:
- Scale Up: Add more nodes to the cluster or increase the resources (CPU, RAM, faster disks) of existing nodes.
- Optimize Sharding: Ensure your indices are sharded appropriately. Too many small shards can overwhelm the cluster. Too few large shards can limit parallelism.
- Bulk API: Ensure you are using the Bulk API for indexing and that your bulk requests are appropriately sized (e.g., 5-15MB per request).
- Refresh Interval: For write-heavy workloads where near real-time search isn’t critical, consider increasing the
index.refresh_interval(e.g., to30sor60sinstead of the default1s).curl -X PUT "localhost:9200/my-index/_settings" -H 'Content-Type: application/json' -d'{"index": {"refresh_interval": "30s"}}' - Why it works: A cluster struggling to keep up with writes will experience increased load on its nodes. This can lead to slow shard recovery, indexing failures, and eventually unassigned shards as nodes become unresponsive or run out of resources. Optimizing indexing and cluster capacity directly addresses this bottleneck.
6. Corrupted Shard Data
-
Diagnosis: Check Elasticsearch logs for messages indicating shard corruption or I/O errors. The
unassigned.reasonmight also indicateALLOCATION_FAILEDwith underlying I/O issues.# Example log message: "failed to read from index file [index/.......] on shard [X]" -
Fix: If a primary shard is corrupted and no replica exists, you might need to force the allocation of a shard copy if you know the data is partially intact on disk. However, the safest approach is often to restore from a snapshot. If replicas are also corrupted, you might need to recreate the index.
- Why it works: Corrupted shard data means Elasticsearch cannot reliably read or write to that shard, making it unavailable. Restoring from a known good state (snapshot) or recreating the index is necessary to bring the affected data back online.
After addressing these issues, you might encounter errors related to circuit_breaker_settings if your cluster was under heavy load and has now recovered enough to start processing pending requests that were previously blocked.