The Elasticsearch cluster failed to elect a master node because nodes couldn’t agree on who should lead, leading to a complete service outage.

The most common culprit is network partition or misconfiguration preventing nodes from communicating with each other, especially during startup or after a node failure. Elasticsearch relies on nodes being able to reach each other via specific ports (default 9300 for inter-node communication) to form a quorum and elect a master. If nodes can’t talk, they can’t coordinate.

Cause 1: Network Connectivity Issues

  • Diagnosis: On each node, run curl -X GET http://<node_ip>:9200/_cluster/health?pretty to check if nodes can reach the HTTP API. Then, from one node, try to telnet to another node on the inter-node port: telnet <other_node_ip> 9300.
  • Fix: Ensure firewalls (e.g., ufw, firewalld, security groups in cloud environments) allow traffic on port 9300 between all nodes in the cluster. For example, on Ubuntu with ufw: sudo ufw allow 9300/tcp.
  • Why it works: This opens the communication channel necessary for nodes to discover and communicate with each other, enabling master election.

Cause 2: Incorrect discovery.seed_hosts Configuration

  • Diagnosis: Examine elasticsearch.yml on all nodes. Look for the discovery.seed_hosts setting. Ensure it lists the IP addresses or hostnames of at least two other stable nodes in the cluster.
  • Fix: Update elasticsearch.yml to include a comma-separated list of seed hosts. For example: discovery.seed_hosts: ["192.168.1.10", "192.168.1.11", "192.168.1.12"].
  • Why it works: Seed hosts are the initial contact points for nodes joining the cluster. Correctly configured seed hosts allow new nodes to find existing ones and begin the discovery process.

Cause 3: Incorrect cluster.initial_master_nodes Configuration

  • Diagnosis: Check elasticsearch.yml on all nodes. This setting is crucial for the very first master election in a new cluster. If it’s missing or incorrect, a master might never be elected.
  • Fix: For a new cluster, set cluster.initial_master_nodes: ["node1", "node2", "node3"] (using the node.name values) in elasticsearch.yml on all nodes. After the cluster has formed and elected its first master, this setting can be removed or commented out for subsequent restarts.
  • Why it works: This tells the cluster which nodes are eligible to become the initial master. It prevents split-brain scenarios during the very first bootstrap by ensuring a majority of these designated nodes must be available.

Cause 4: Insufficient Number of Master-Eligible Nodes or Nodes in Quorum

  • Diagnosis: Check the node.roles setting in elasticsearch.yml. Ensure you have at least 3 nodes with node_roles: [ master, data, ingest ] (or similar combinations that include master). Also, verify that a majority of master-eligible nodes are running and reachable. The required quorum is (N / 2) + 1, where N is the number of master-eligible nodes.
  • Fix: Add more nodes with the master role, or ensure enough existing master-eligible nodes are online. If you have 3 master-eligible nodes, at least 2 must be available to form a quorum. If you have 5, at least 3 must be available.
  • Why it works: Elasticsearch requires a quorum of master-eligible nodes to make decisions, including electing a master. This prevents a minority partition from electing a master, thus avoiding data inconsistency.

Cause 5: cluster.election.timeout Too Low

  • Diagnosis: Review elasticsearch.yml for cluster.election.timeout. If this value is set too low (e.g., 1s), nodes might time out waiting for a master election to complete, especially on slower networks or under heavy load.
  • Fix: Increase the cluster.election.timeout value. A common starting point is 30s or 60s. Example: cluster.election.timeout: 30s.
  • Why it works: A longer election timeout gives nodes more time to communicate, discover each other, and reach a consensus for master election, especially in environments with higher latency or startup delays.

Cause 6: Corrupted nodes/0/node.lock File

  • Diagnosis: On each Elasticsearch data directory (usually /var/lib/elasticsearch/), navigate to nodes/0/. Check for a file named node.lock. If Elasticsearch is not running and this file exists, it might be stale.
  • Fix: Stop all Elasticsearch instances on the affected node. Delete the nodes/0/node.lock file. Then, restart Elasticsearch. Example: sudo rm /var/lib/elasticsearch/nodes/0/node.lock.
  • Why it works: This lock file prevents multiple Elasticsearch instances from running on the same data directory. A stale lock file can prevent Elasticsearch from starting correctly, hindering its ability to participate in cluster operations.

After resolving these, you might encounter "IndexNotFoundException" if shards were assigned to nodes that are now offline and unavailable to recover.

Want structured learning?

Take the full Elasticsearch course →