The most surprising truth about MySQL primary-replica failover is that the "automatic" part is often a carefully orchestrated illusion, relying on external tools to detect failure and manually trigger promotion.

Imagine this setup: a primary MySQL server (let’s call it mysql-primary) handles all writes, and one or more replica servers (mysql-replica-1, mysql-replica-2) mirror its data for read scaling and disaster recovery. When mysql-primary goes down, we need mysql-replica-1 to seamlessly take over as the new primary.

Here’s a simplified view of the components involved.

MySQL Replication Basics:

  • Binary Log (binlog): The primary server records all data-changing events (inserts, updates, deletes) in its binary log.
  • Replication Threads:
    • I/O Thread (on replica): Connects to the primary, reads binlog events, and writes them to a relay log on the replica.
    • SQL Thread (on replica): Reads events from the relay log and applies them to the replica’s data.
  • CHANGE MASTER TO: This command on the replica tells it where to connect to the primary and which binlog file and position to start reading from.
  • SHOW MASTER STATUS: On the primary, this shows the current binlog file and position.
  • SHOW REPLICA STATUS (or SHOW SLAVE STATUS in older versions): On the replica, this shows the I/O and SQL thread states, the last read binlog file/position from the primary, and the executed binlog file/position. Crucially, Seconds_Behind_Master (or Seconds_Behind_Source) tells you how far behind the replica is.

The Failover Illusion:

"Automatic" failover typically involves:

  1. Monitoring: A separate process (like Orchestrator, MHA, or a custom script) constantly checks the health of the primary. This might involve pinging the server, checking if SHOW MASTER STATUS returns valid data, or even running a quick query.
  2. Failure Detection: When the monitor can’t reach the primary or detects an error, it flags a failure.
  3. Replica Identification: The monitor identifies a suitable replica to promote. This usually means the replica that is most caught up (lowest Seconds_Behind_Master) and healthy.
  4. Promotion: The monitor executes commands on the chosen replica to:
    • Stop the I/O and SQL threads (if they are running).
    • Discard any remaining relay log events that might be inconsistent.
    • Tell MySQL to stop replicating from the old primary and start accepting writes as a new primary. This is often done by setting read_only = 0 (if it was set) and potentially reconfiguring server_id.
  5. Reconfiguration: The monitor then updates other replicas to replicate from the new primary, and directs application traffic to it.

Setting Up a Basic Failover Scenario (Conceptual Example):

Let’s say we have two MySQL instances, db-prod-1 (primary) and db-prod-2 (replica).

On db-prod-1 (Primary):

-- Ensure binary logging is enabled
SHOW VARIABLES LIKE 'log_bin';
-- Expected output: log_bin | ON

-- Set a unique server ID (important for replication)
SHOW VARIABLES LIKE 'server_id';
-- Expected output: server_id | 1 (or some unique number)

-- Create a replication user
CREATE USER 'replicator'@'%' IDENTIFIED BY 'your_strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;

-- Get the current binlog position
SHOW MASTER STATUS;
-- Example output: File | mysql-bin.000001, Position | 12345

On db-prod-2 (Replica):

-- Set a unique server ID, different from the primary
SHOW VARIABLES LIKE 'server_id';
-- Expected output: server_id | 2 (or some unique number)

-- Configure replication to point to the primary
CHANGE MASTER TO
  MASTER_HOST='db-prod-1',
  MASTER_USER='replicator',
  MASTER_PASSWORD='your_strong_password',
  MASTER_LOG_FILE='mysql-bin.000001', -- From SHOW MASTER STATUS on primary
  MASTER_LOG_POS=12345;             -- From SHOW MASTER STATUS on primary

-- Start replication threads
START SLAVE; -- Or START REPLICA;

-- Check replica status
SHOW SLAVE STATUS\G; -- Or SHOW REPLICA STATUS\G
-- Look for:
-- Slave_IO_Running: Yes
-- Slave_SQL_Running: Yes
-- Seconds_Behind_Master: 0 (or a small, stable number)

The "Automatic" Part - Using Orchestrator:

Orchestrator is a popular tool for this. You’d install it on a separate server. It scans your MySQL topology.

  1. Configuration: You’d tell Orchestrator how to connect to your MySQL instances (credentials, hostnames).
  2. Monitoring: Orchestrator continuously checks the health of db-prod-1. It might query SHOW MASTER STATUS and check Seconds_Behind_Master on replicas.
  3. Failure Detection: If db-prod-1 becomes unreachable, Orchestrator detects it.
  4. Promotion Command (Orchestrator UI/CLI): You would typically go to the Orchestrator web UI and click "Discover," then "Promote" on db-prod-2. Orchestrator then executes the necessary steps:
    • It stops replication on db-prod-2.
    • It makes db-prod-2 writable (SET GLOBAL read_only = 0;).
    • It tells other replicas (if any) to start replicating from db-prod-2 as the new primary.

The Real Mechanics of Promotion:

When Orchestrator (or similar tool) promotes db-prod-2, it’s essentially doing this on db-prod-2:

  1. STOP SLAVE; (or STOP REPLICA;) - Prevents further replication from the old primary.
  2. RESET SLAVE ALL; - Clears replication configuration. This is a strong indicator that this server is no longer a replica.
  3. SET GLOBAL read_only = 0; - Makes the server writable.
  4. SET GLOBAL server_id = <old_primary_server_id>; (Optional, but can simplify things if you want the promoted server to retain the original primary’s ID).
  5. FLUSH TABLES WITH READ LOCK; (briefly, if needed to ensure data consistency before reconfiguring other replicas).

Then, it reconfigures other replicas to point to the new primary (db-prod-2) using CHANGE MASTER TO with db-prod-2’s hostname and its current binlog position.

The key is that MySQL itself doesn’t have a built-in "failover" command. It’s a series of SQL commands executed by an external orchestrator after it determines the primary is dead.

The next hurdle you’ll typically face is managing split-brain scenarios, where network issues might cause both the old primary (if it recovers) and the new primary to accept writes independently, leading to data divergence.

Want structured learning?

Take the full Express course →