The Elasticsearch cluster failed to elect a master node because nodes couldn’t agree on who should lead, leading to a complete service outage.
The most common culprit is network partition or misconfiguration preventing nodes from communicating with each other, especially during startup or after a node failure. Elasticsearch relies on nodes being able to reach each other via specific ports (default 9300 for inter-node communication) to form a quorum and elect a master. If nodes can’t talk, they can’t coordinate.
Cause 1: Network Connectivity Issues
- Diagnosis: On each node, run
curl -X GET http://<node_ip>:9200/_cluster/health?prettyto check if nodes can reach the HTTP API. Then, from one node, try to telnet to another node on the inter-node port:telnet <other_node_ip> 9300. - Fix: Ensure firewalls (e.g.,
ufw,firewalld, security groups in cloud environments) allow traffic on port 9300 between all nodes in the cluster. For example, on Ubuntu withufw:sudo ufw allow 9300/tcp. - Why it works: This opens the communication channel necessary for nodes to discover and communicate with each other, enabling master election.
Cause 2: Incorrect discovery.seed_hosts Configuration
- Diagnosis: Examine
elasticsearch.ymlon all nodes. Look for thediscovery.seed_hostssetting. Ensure it lists the IP addresses or hostnames of at least two other stable nodes in the cluster. - Fix: Update
elasticsearch.ymlto include a comma-separated list of seed hosts. For example:discovery.seed_hosts: ["192.168.1.10", "192.168.1.11", "192.168.1.12"]. - Why it works: Seed hosts are the initial contact points for nodes joining the cluster. Correctly configured seed hosts allow new nodes to find existing ones and begin the discovery process.
Cause 3: Incorrect cluster.initial_master_nodes Configuration
- Diagnosis: Check
elasticsearch.ymlon all nodes. This setting is crucial for the very first master election in a new cluster. If it’s missing or incorrect, a master might never be elected. - Fix: For a new cluster, set
cluster.initial_master_nodes: ["node1", "node2", "node3"](using thenode.namevalues) inelasticsearch.ymlon all nodes. After the cluster has formed and elected its first master, this setting can be removed or commented out for subsequent restarts. - Why it works: This tells the cluster which nodes are eligible to become the initial master. It prevents split-brain scenarios during the very first bootstrap by ensuring a majority of these designated nodes must be available.
Cause 4: Insufficient Number of Master-Eligible Nodes or Nodes in Quorum
- Diagnosis: Check the
node.rolessetting inelasticsearch.yml. Ensure you have at least 3 nodes withnode_roles: [ master, data, ingest ](or similar combinations that includemaster). Also, verify that a majority of master-eligible nodes are running and reachable. The required quorum is(N / 2) + 1, where N is the number of master-eligible nodes. - Fix: Add more nodes with the
masterrole, or ensure enough existing master-eligible nodes are online. If you have 3 master-eligible nodes, at least 2 must be available to form a quorum. If you have 5, at least 3 must be available. - Why it works: Elasticsearch requires a quorum of master-eligible nodes to make decisions, including electing a master. This prevents a minority partition from electing a master, thus avoiding data inconsistency.
Cause 5: cluster.election.timeout Too Low
- Diagnosis: Review
elasticsearch.ymlforcluster.election.timeout. If this value is set too low (e.g.,1s), nodes might time out waiting for a master election to complete, especially on slower networks or under heavy load. - Fix: Increase the
cluster.election.timeoutvalue. A common starting point is30sor60s. Example:cluster.election.timeout: 30s. - Why it works: A longer election timeout gives nodes more time to communicate, discover each other, and reach a consensus for master election, especially in environments with higher latency or startup delays.
Cause 6: Corrupted nodes/0/node.lock File
- Diagnosis: On each Elasticsearch data directory (usually
/var/lib/elasticsearch/), navigate tonodes/0/. Check for a file namednode.lock. If Elasticsearch is not running and this file exists, it might be stale. - Fix: Stop all Elasticsearch instances on the affected node. Delete the
nodes/0/node.lockfile. Then, restart Elasticsearch. Example:sudo rm /var/lib/elasticsearch/nodes/0/node.lock. - Why it works: This lock file prevents multiple Elasticsearch instances from running on the same data directory. A stale lock file can prevent Elasticsearch from starting correctly, hindering its ability to participate in cluster operations.
After resolving these, you might encounter "IndexNotFoundException" if shards were assigned to nodes that are now offline and unavailable to recover.