A Galera cluster can, and often will, split into two or more independent clusters that no longer synchronize.

Here’s how to prevent and recover from that dreaded "split-brain" scenario.

Prevention is Key

The most common culprit for split-brain is a network interruption that isolates nodes from each other, preventing them from receiving or sending replication traffic. This can manifest as a complete network outage between data centers, a faulty switch, or even a misconfigured firewall blocking traffic on the Galera ports (usually 3306 for MySQL and 4567 for Galera replication).

1. Network Stability and Redundancy: This sounds obvious, but it’s the bedrock. Ensure your network infrastructure between Galera nodes is robust.

  • Diagnosis: Monitor network latency and packet loss between nodes. Tools like ping (with -c 100 for 100 packets) and mtr are your friends.
    ping -c 100 <node_ip>
    mtr <node_ip>
    
  • Fix: Implement redundant network paths. Use bonded interfaces on your servers. If you’re in a multi-datacenter setup, ensure your inter-DC links are highly available and have sufficient bandwidth. This prevents a single cable or switch failure from taking down the cluster.
  • Why it works: Redundant paths ensure that if one network link fails, traffic can still flow between nodes, maintaining quorum and preventing isolation.

2. Quorum and wsrep_provider_options: Galera relies on a quorum to make decisions. If a node can’t communicate with a majority of the cluster, it will stop accepting writes to prevent inconsistencies. The wsrep_cluster_address setting is critical here.

  • Diagnosis: Check your my.cnf (or galera.cnf) for the wsrep_cluster_address setting. It should list all nodes in the cluster.
    [galera]
    wsrep_cluster_address = "gcomm://192.168.1.101,192.168.1.102,192.168.1.103"
    
  • Fix: Ensure this setting accurately lists all nodes intended to be in the cluster. If a node is missing, it might not know about the others and could get isolated. Restarting the node after correction is necessary.
  • Why it works: This directive tells each node which other nodes it should attempt to connect to and form a cluster with. A complete list is vital for initial bootstrapping and ongoing communication.

3. wsrep_sst_donor and SST Failures: When a new node joins or a restarted node needs to catch up, it performs a State Snapshot Transfer (SST). If this process fails, it can leave a node in an inconsistent state or prevent it from joining.

  • Diagnosis: Check the MySQL error logs (mysqld.log or similar) on both the donor and the joining node for SST-related errors. Look for messages indicating failed connections, timeouts, or data transfer issues.
    grep "SST failed" /var/log/mysql/mysqld.log
    
  • Fix: Ensure the SST user has sufficient privileges (RELOAD, LOCK TABLES, PROCESS, REPLICATION CLIENT, SUPER) on the donor node. Verify that wsrep_sst_method (e.g., rsync, xtrabackup-v2) is configured correctly and that the necessary tools are installed and accessible on all nodes. If using xtrabackup, ensure it’s compatible with your MySQL version.
  • Why it works: A successful SST ensures the joining node receives a consistent, up-to-date copy of the data, allowing it to rejoin the cluster without causing divergence.

4. wsrep_cluster_name Consistency: A simple typo or omission in this parameter can prevent nodes from recognizing each other as part of the same cluster.

  • Diagnosis: Verify the wsrep_cluster_name setting in my.cnf on all nodes.
    [galera]
    wsrep_cluster_name = "my_galera_cluster"
    
  • Fix: Ensure the wsrep_cluster_name is identical across all nodes. A mismatch will cause them to form separate, non-communicating clusters. Restart all nodes after correcting.
  • Why it works: This parameter acts as a unique identifier for your cluster. Nodes will only attempt to join or communicate with other nodes that share the exact same cluster name.

5. gmcast.listen_addr and gmcast.mcast_addr: These settings relate to how Galera nodes discover each other, especially in multicast or specific network configurations. Incorrect settings here can lead to nodes being unable to find peers.

  • Diagnosis: Examine your my.cnf for gmcast.listen_addr and gmcast.mcast_addr.
    [galera]
    gmcast.listen_addr = "tcp://0.0.0.0:4567"
    gmcast.mcast_addr = "239.255.255.255:4567" # If using multicast
    
    If you’re not using multicast (which is generally recommended for stability), ensure gmcast.listen_addr is set correctly and that nodes are configured to use gcomm:// with IP addresses.
  • Fix: For unicast (recommended), ensure wsrep_cluster_address is correctly configured with IP addresses of peers. Disable multicast (gmcast.mcast_addr) if not explicitly needed and understood, as it can be unreliable on many modern networks. Ensure firewalls allow UDP traffic on the multicast address if it’s used.
  • Why it works: These settings dictate how nodes broadcast their presence and listen for other nodes. Correct configuration ensures nodes can discover and communicate with each other.

6. innodb_flush_log_at_trx_commit and Data Integrity: While not a direct cause of split-brain, an incorrect setting here can exacerbate data loss after a split occurs.

  • Diagnosis: Check innodb_flush_log_at_trx_commit. A value of 1 is ACID compliant but can be slower. Values of 0 or 2 are faster but risk data loss on crash.
    [mysqld]
    innodb_flush_log_at_trx_commit = 1
    
  • Fix: For Galera, innodb_flush_log_at_trx_commit = 1 is strongly recommended to ensure data integrity across nodes. If it’s set lower, and a node crashes or becomes isolated, transactions that were acknowledged but not yet flushed to disk could be lost when that node rejoins or is restarted.
  • Why it works: Setting this to 1 ensures that each committed transaction’s log entry is flushed to disk synchronously, guaranteeing durability even if the server crashes immediately after acknowledging the commit.

Recovering from Split-Brain

When split-brain happens, you’ll typically see nodes stop accepting writes, and the MySQL error logs will show messages about failing to reach quorum or diverging states.

The General Strategy:

  1. Identify the "True" Cluster: Determine which of the split partitions contains the most up-to-date and correct data. This often involves looking at transaction logs, application state, or simply which partition has the majority of nodes.
  2. Isolate the "Bad" Partition: Stop MySQL (or Galera) on all nodes in the partition you’ve deemed incorrect. This prevents them from accepting new writes or further diverging.
  3. Bootstrap the "Good" Partition: If your "true" cluster is only a subset of the original nodes, you might need to restart it. If one node is already in a good state, you can use it to bootstrap the others.
  4. Re-integrate and Re-sync: Start the nodes from the "bad" partition one by one, ensuring they perform an SST from a node in the "good" partition.

Specific Recovery Steps (Common Scenario: Two Partitions, A and B)

Let’s say nodes N1, N2 are in partition A, and N3, N4 are in partition B. You’ve determined A is the correct partition.

  1. Stop Writes on Partition B:

    • On N3:
      sudo systemctl stop mysql
      # Or: sudo service mysql stop
      
    • On N4:
      sudo systemctl stop mysql
      # Or: sudo service mysql stop
      
    • Why it works: This prevents any further writes from occurring on the partition that is considered "wrong," stopping data divergence.
  2. Bootstrap Partition A (if necessary): If N1 and N2 are still running and communicating, you might not need to do anything here. If they also stopped or you want to be sure:

    • On N1 (assuming it’s the most stable node):
      # Ensure N1 is clean and ready to bootstrap
      # If N1 was part of a split, you might need to clear its Galera state.
      # This is dangerous and depends on your setup. A common method
      # is to stop mysql, delete grastate.dat and ibdata files, then start.
      # BE EXTREMELY CAREFUL WITH THIS.
      sudo systemctl stop mysql
      sudo rm /var/lib/mysql/grastate.dat
      # You might need to remove ibdata files too, but this is a full data reset.
      # For a simple restart of a node that was NOT part of the bad partition:
      sudo systemctl start mysql
      
    • Why it works: Starting a node with a clean grastate.dat (or by deleting it and associated data files if a full reset is needed) forces it to re-initialize its Galera state. If it’s the only node, it will attempt to bootstrap a new cluster.
  3. Re-integrate Partition B Nodes: Now, bring N3 and N4 back online, but have them join Partition A. This is done by starting them with an SST.

    • On N3 (assuming N1 is now a healthy node in the correct cluster):
      # Ensure N3 is clean. Stop MySQL if running.
      sudo systemctl stop mysql
      # Reset N3's Galera state to ensure it performs an SST.
      # Typically, this means ensuring grastate.dat indicates it needs SST.
      # The safest way is often to remove grastate.dat and let it start fresh.
      sudo rm /var/lib/mysql/grastate.dat
      # Restart MySQL. It will see no grastate.dat and attempt to join the cluster
      # specified in wsrep_cluster_address, performing an SST.
      sudo systemctl start mysql
      
    • Repeat for N4.
    • Why it works: By removing grastate.dat, you signal to Galera that this node doesn’t have a valid cluster state and needs to perform a full State Snapshot Transfer (SST) from a healthy node in the cluster it’s trying to join. This overwrites its local data with a consistent copy.
  4. Verify and Monitor: Once all nodes are back up, check SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size'; on each node. It should show the correct number of nodes. Monitor for any new errors.

The next error you’ll hit after fixing split-brain is usually related to application-level data consistency issues if the split-brain wasn’t handled perfectly, or perhaps a resource exhaustion problem if your cluster was already under heavy load.

Want structured learning?

Take the full Express course →