A Galera cluster can, and often will, split into two or more independent clusters that no longer synchronize.
Here’s how to prevent and recover from that dreaded "split-brain" scenario.
Prevention is Key
The most common culprit for split-brain is a network interruption that isolates nodes from each other, preventing them from receiving or sending replication traffic. This can manifest as a complete network outage between data centers, a faulty switch, or even a misconfigured firewall blocking traffic on the Galera ports (usually 3306 for MySQL and 4567 for Galera replication).
1. Network Stability and Redundancy: This sounds obvious, but it’s the bedrock. Ensure your network infrastructure between Galera nodes is robust.
- Diagnosis: Monitor network latency and packet loss between nodes. Tools like
ping(with-c 100for 100 packets) andmtrare your friends.ping -c 100 <node_ip> mtr <node_ip> - Fix: Implement redundant network paths. Use bonded interfaces on your servers. If you’re in a multi-datacenter setup, ensure your inter-DC links are highly available and have sufficient bandwidth. This prevents a single cable or switch failure from taking down the cluster.
- Why it works: Redundant paths ensure that if one network link fails, traffic can still flow between nodes, maintaining quorum and preventing isolation.
2. Quorum and wsrep_provider_options:
Galera relies on a quorum to make decisions. If a node can’t communicate with a majority of the cluster, it will stop accepting writes to prevent inconsistencies. The wsrep_cluster_address setting is critical here.
- Diagnosis: Check your
my.cnf(orgalera.cnf) for thewsrep_cluster_addresssetting. It should list all nodes in the cluster.[galera] wsrep_cluster_address = "gcomm://192.168.1.101,192.168.1.102,192.168.1.103" - Fix: Ensure this setting accurately lists all nodes intended to be in the cluster. If a node is missing, it might not know about the others and could get isolated. Restarting the node after correction is necessary.
- Why it works: This directive tells each node which other nodes it should attempt to connect to and form a cluster with. A complete list is vital for initial bootstrapping and ongoing communication.
3. wsrep_sst_donor and SST Failures:
When a new node joins or a restarted node needs to catch up, it performs a State Snapshot Transfer (SST). If this process fails, it can leave a node in an inconsistent state or prevent it from joining.
- Diagnosis: Check the MySQL error logs (
mysqld.logor similar) on both the donor and the joining node for SST-related errors. Look for messages indicating failed connections, timeouts, or data transfer issues.grep "SST failed" /var/log/mysql/mysqld.log - Fix: Ensure the SST user has sufficient privileges (
RELOAD,LOCK TABLES,PROCESS,REPLICATION CLIENT,SUPER) on the donor node. Verify thatwsrep_sst_method(e.g.,rsync,xtrabackup-v2) is configured correctly and that the necessary tools are installed and accessible on all nodes. If usingxtrabackup, ensure it’s compatible with your MySQL version. - Why it works: A successful SST ensures the joining node receives a consistent, up-to-date copy of the data, allowing it to rejoin the cluster without causing divergence.
4. wsrep_cluster_name Consistency:
A simple typo or omission in this parameter can prevent nodes from recognizing each other as part of the same cluster.
- Diagnosis: Verify the
wsrep_cluster_namesetting inmy.cnfon all nodes.[galera] wsrep_cluster_name = "my_galera_cluster" - Fix: Ensure the
wsrep_cluster_nameis identical across all nodes. A mismatch will cause them to form separate, non-communicating clusters. Restart all nodes after correcting. - Why it works: This parameter acts as a unique identifier for your cluster. Nodes will only attempt to join or communicate with other nodes that share the exact same cluster name.
5. gmcast.listen_addr and gmcast.mcast_addr:
These settings relate to how Galera nodes discover each other, especially in multicast or specific network configurations. Incorrect settings here can lead to nodes being unable to find peers.
- Diagnosis: Examine your
my.cnfforgmcast.listen_addrandgmcast.mcast_addr.
If you’re not using multicast (which is generally recommended for stability), ensure[galera] gmcast.listen_addr = "tcp://0.0.0.0:4567" gmcast.mcast_addr = "239.255.255.255:4567" # If using multicastgmcast.listen_addris set correctly and that nodes are configured to usegcomm://with IP addresses. - Fix: For unicast (recommended), ensure
wsrep_cluster_addressis correctly configured with IP addresses of peers. Disable multicast (gmcast.mcast_addr) if not explicitly needed and understood, as it can be unreliable on many modern networks. Ensure firewalls allow UDP traffic on the multicast address if it’s used. - Why it works: These settings dictate how nodes broadcast their presence and listen for other nodes. Correct configuration ensures nodes can discover and communicate with each other.
6. innodb_flush_log_at_trx_commit and Data Integrity:
While not a direct cause of split-brain, an incorrect setting here can exacerbate data loss after a split occurs.
- Diagnosis: Check
innodb_flush_log_at_trx_commit. A value of1is ACID compliant but can be slower. Values of0or2are faster but risk data loss on crash.[mysqld] innodb_flush_log_at_trx_commit = 1 - Fix: For Galera,
innodb_flush_log_at_trx_commit = 1is strongly recommended to ensure data integrity across nodes. If it’s set lower, and a node crashes or becomes isolated, transactions that were acknowledged but not yet flushed to disk could be lost when that node rejoins or is restarted. - Why it works: Setting this to
1ensures that each committed transaction’s log entry is flushed to disk synchronously, guaranteeing durability even if the server crashes immediately after acknowledging the commit.
Recovering from Split-Brain
When split-brain happens, you’ll typically see nodes stop accepting writes, and the MySQL error logs will show messages about failing to reach quorum or diverging states.
The General Strategy:
- Identify the "True" Cluster: Determine which of the split partitions contains the most up-to-date and correct data. This often involves looking at transaction logs, application state, or simply which partition has the majority of nodes.
- Isolate the "Bad" Partition: Stop MySQL (or Galera) on all nodes in the partition you’ve deemed incorrect. This prevents them from accepting new writes or further diverging.
- Bootstrap the "Good" Partition: If your "true" cluster is only a subset of the original nodes, you might need to restart it. If one node is already in a good state, you can use it to bootstrap the others.
- Re-integrate and Re-sync: Start the nodes from the "bad" partition one by one, ensuring they perform an SST from a node in the "good" partition.
Specific Recovery Steps (Common Scenario: Two Partitions, A and B)
Let’s say nodes N1, N2 are in partition A, and N3, N4 are in partition B. You’ve determined A is the correct partition.
-
Stop Writes on Partition B:
- On
N3:sudo systemctl stop mysql # Or: sudo service mysql stop - On
N4:sudo systemctl stop mysql # Or: sudo service mysql stop - Why it works: This prevents any further writes from occurring on the partition that is considered "wrong," stopping data divergence.
- On
-
Bootstrap Partition A (if necessary): If
N1andN2are still running and communicating, you might not need to do anything here. If they also stopped or you want to be sure:- On
N1(assuming it’s the most stable node):# Ensure N1 is clean and ready to bootstrap # If N1 was part of a split, you might need to clear its Galera state. # This is dangerous and depends on your setup. A common method # is to stop mysql, delete grastate.dat and ibdata files, then start. # BE EXTREMELY CAREFUL WITH THIS. sudo systemctl stop mysql sudo rm /var/lib/mysql/grastate.dat # You might need to remove ibdata files too, but this is a full data reset. # For a simple restart of a node that was NOT part of the bad partition: sudo systemctl start mysql - Why it works: Starting a node with a clean
grastate.dat(or by deleting it and associated data files if a full reset is needed) forces it to re-initialize its Galera state. If it’s the only node, it will attempt to bootstrap a new cluster.
- On
-
Re-integrate Partition B Nodes: Now, bring
N3andN4back online, but have them join Partition A. This is done by starting them with an SST.- On
N3(assumingN1is now a healthy node in the correct cluster):# Ensure N3 is clean. Stop MySQL if running. sudo systemctl stop mysql # Reset N3's Galera state to ensure it performs an SST. # Typically, this means ensuring grastate.dat indicates it needs SST. # The safest way is often to remove grastate.dat and let it start fresh. sudo rm /var/lib/mysql/grastate.dat # Restart MySQL. It will see no grastate.dat and attempt to join the cluster # specified in wsrep_cluster_address, performing an SST. sudo systemctl start mysql - Repeat for
N4. - Why it works: By removing
grastate.dat, you signal to Galera that this node doesn’t have a valid cluster state and needs to perform a full State Snapshot Transfer (SST) from a healthy node in the cluster it’s trying to join. This overwrites its local data with a consistent copy.
- On
-
Verify and Monitor: Once all nodes are back up, check
SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';on each node. It should show the correct number of nodes. Monitor for any new errors.
The next error you’ll hit after fixing split-brain is usually related to application-level data consistency issues if the split-brain wasn’t handled perfectly, or perhaps a resource exhaustion problem if your cluster was already under heavy load.