A distributed system can’t actually know if it’s split into two independent halves, which is why preventing split-brain is more about designing for failure than achieving perfect certainty.
Imagine you have two database servers, A and B, in a cluster. They’re supposed to work together, replicating data. If the network connection between them breaks, each server might think the other one is down and start operating independently. This is "split-brain." If a client then writes data to A, and later writes different data to B (because it thinks they’re separate), you’ve got conflicting data that’s a nightmare to reconcile.
Here’s a typical scenario. We have a primary/replica setup for a service.
{
"service_name": "my_app",
"replicas": [
{"host": "db-a.example.com", "role": "primary"},
{"host": "db-b.example.com", "role": "replica"}
],
"consistency_level": "quorum"
}
If the network between db-a and db-b fails, db-a might continue accepting writes. db-b, unable to reach db-a, might also decide it needs to become a primary (if it has the capability) or simply stop serving reads because it can’t confirm the state of the primary. The key is that there’s no single source of truth anymore.
The core problem is that network partitions are indistinguishable from node failures. If db-a stops responding, is it because db-a crashed, or because the network to db-a is broken? The system can’t tell.
To prevent this, we introduce Quorums. A quorum is a minimum number of nodes that must agree for an operation to be considered successful. If you have 3 nodes (A, B, C) and require a quorum of 2 for writes, a write only succeeds if at least 2 nodes acknowledge it. If the network splits into A on one side and B, C on the other, A cannot achieve a quorum of 2, so it stops accepting writes. B and C can achieve a quorum of 2, so they can continue operating.
Quorum Calculation:
For a cluster of N nodes, a majority quorum is floor(N/2) + 1.
- 3 nodes:
floor(3/2) + 1 = 1 + 1 = 2 - 5 nodes:
floor(5/2) + 1 = 2 + 1 = 3
Implementation Example (Conceptual - based on Raft/Paxos): In a Raft-based system, the leader (primary) must have a majority of nodes acknowledge its leadership before it can commit a log entry (write). If it loses contact with a majority of nodes, it steps down.
# Example log entry commit in a Raft-like system
# Leader (Node 1) sends AppendEntries RPC to followers (Node 2, Node 3)
# Node 1: "Log entry X committed"
# Node 2: "Acknowledged."
# Node 3: "Acknowledged."
# Node 1 now has 2/3 acknowledgments, achieving a majority quorum.
# If network to Node 3 breaks:
# Node 1: "Log entry X committed"
# Node 2: "Acknowledged."
# Node 3: (No response)
# Node 1 cannot achieve a majority quorum and will not commit the entry.
Quorums alone, however, don’t solve the problem if a node thinks it’s the primary and starts acting independently. This is where Fencing comes in. Fencing is a mechanism to ensure that only one node can act as the primary at any given time, even if it loses network connectivity to the rest of the cluster.
The most common fencing mechanism is using a distributed lock manager or a witness/arbitrator node.
1. Distributed Lock Manager (e.g., ZooKeeper, etcd): Before a node can become primary, it must acquire a lock on a specific key in the lock manager. If the network partitions, only one "partition" can successfully acquire the lock.
Diagnosis: Check if the lock in ZooKeeper/etcd is held by the expected primary.
zkCli.sh -server <zk_host>:2181 ls /mylocks/primary
Fix: If a stale lock is held, manually delete it. This is a last resort and indicates a more fundamental issue.
zkCli.sh -server <zk_host>:2181 rmr /mylocks/primary/lock-xyz
Why it works: The lock manager is designed to be highly available and uses its own internal consensus mechanism (like Raft or Zab) to ensure only one client holds a lock at a time.
2. Witness/Arbitrator Node: This is a third, non-voting node that doesn’t hold any data but participates in the quorum decision. It’s usually a lightweight process. If a node wants to become primary, it must communicate with the witness node.
Diagnosis: Check the logs of the witness node for any attempts by nodes to claim leadership.
tail -f /var/log/witness.log
Fix: Ensure the witness node is running and reachable by all potential primaries. No direct "fix" command, it’s about ensuring its availability.
Why it works: The witness acts as a tie-breaker. If the cluster splits into two nodes (A, B) and a witness (W), only the side that can communicate with W can form a quorum. If A can reach W, but B cannot, A can proceed. If neither can reach W, neither can form a quorum.
3. SCSI-3 Persistent Reservations (for shared storage): In older, shared-storage high-availability setups (like active/passive failover clusters), nodes use SCSI reservations. If Node A is primary, it reserves the shared disk. If Node B tries to take over, it can’t access the disk due to the reservation. If Node A crashes, the reservation might be lost (or timed out), allowing Node B to acquire it.
Diagnosis: Use lsscsi and sg_map on Linux to inspect SCSI devices and reservations.
lsscsi
sg_map -a /dev/sgX (where X is the relevant device)
Fix: This is usually managed by the cluster software (e.g., Pacemaker) which handles the reservation release and acquisition. Manual intervention is rare and complex, often involving sg_persist to clear reservations.
Why it works: The SCSI protocol itself enforces exclusive access to the storage device.
4. Fencing Agents/Token-Based Fencing: Many cluster managers (like Pacemaker) use fencing agents. When a node is deemed unhealthy, the cluster manager invokes an agent to "fence" it. This could mean: * Powering off the node via IPMI or a PDU. * Forcing a network reset. * Issuing a command to the storage array to revoke access.
Diagnosis: Check the cluster manager logs for fencing actions.
tail -f /var/log/cluster/corosync.log
tail -f /var/log/pacemaker/pacemaker.log
Fix: Ensure the fencing agent (e.g., fence_ipmilan, fence_vmware_soap) is correctly configured with credentials and network access to the target nodes or their management interfaces.
Why it works: It’s a more aggressive, out-of-band method to forcibly prevent a rogue node from operating.
The Combined Approach: Most robust systems use a combination. A common pattern is:
- Majority Quorum: Ensures that at least half the nodes must agree for operations.
- Fencing: Guarantees that only one node can be active as primary, typically via a lock service or a witness.
If all these mechanisms fail, the next error you’ll likely see is related to data inconsistency or a service outage as the system eventually detects conflicting states or simply halts operations to prevent further corruption.