MySQL replication lag is a symptom of the replica server failing to apply events from the primary server’s binary log as quickly as the primary server is generating them.
Common Causes and Fixes for MySQL Replication Lag
1. Network Latency/Bandwidth Saturation
- Diagnosis: Use
pingandiperf3between the primary and replica to check for high latency or low throughput.ping <replica_ip> iperf3 -c <replica_ip> - Fix: If network issues are identified, work with your network team to increase bandwidth or reduce latency. This might involve upgrading network hardware, optimizing routing, or moving servers closer physically.
- Why it works: Faster, more reliable data transfer between primary and replica ensures the replica receives binlog events promptly.
2. Under-provisioned Replica Server Resources (CPU/RAM)
- Diagnosis: Monitor replica server’s CPU and RAM utilization. High CPU usage (consistently > 80%) or swapping indicates resource starvation.
top # or htop free -m - Fix: Upgrade the replica server’s hardware (more CPU cores, more RAM) or optimize its configuration.
# my.cnf on replica innodb_buffer_pool_size = 4G # (adjust based on available RAM, typically 70-80%) innodb_log_file_size = 512M # (larger can help with write throughput) max_connections = 500 # (ensure enough for your workload) - Why it works: A more powerful replica server can process I/O and CPU-bound tasks faster, including applying binlog events.
3. Inefficient Queries on the Replica
- Diagnosis: The replica’s
SHOW PROCESSLISTmight show long-runningApplying eventorWaiting for an eventstates, often correlating with slow queries on the replica itself. Check the replica’s slow query log.SHOW PROCESSLIST; # In my.cnf on replica: # slow_query_log = 1 # slow_query_log_file = /var/log/mysql/mysql-slow.log # long_query_time = 2 - Fix: Identify and optimize slow queries running on the replica. This could involve adding indexes, rewriting queries, or ensuring
read_onlyis set on the replica to prevent accidental writes that could be slow.-- Example: Add an index to speed up a common replica query CREATE INDEX idx_user_id ON orders (user_id); - Why it works: By reducing the time spent on queries within the replica, more time is freed up to apply incoming binlog events.
4. Large Transactions on the Primary
- Diagnosis: Large transactions on the primary can cause a significant backlog of events in the binary log. Check
SHOW MASTER STATUSon the primary andSHOW REPLICA STATUSon the replica forExec_Master_Log_PosandMaster_Log_File. If the replica is far behind in terms of positions within the same log file, large transactions might be the culprit.# On Primary SHOW MASTER STATUS; # On Replica SHOW REPLICA STATUS\G - Fix: Break down large transactions into smaller ones. Implement best practices like avoiding
INSERT ... SELECTon very large tables without aLIMITclause, or using batch inserts.-- Instead of: -- INSERT INTO huge_table SELECT * FROM another_huge_table WHERE ...; -- Use batching: -- INSERT INTO huge_table SELECT * FROM another_huge_table WHERE ... LIMIT 1000; -- Then repeat with a different LIMIT or WHERE clause. - Why it works: Smaller transactions generate fewer, smaller binlog events, allowing the replica to keep pace more easily.
5. Primary Server I/O Bottleneck
- Diagnosis: If the primary server’s disk subsystem is saturated, it will be slow to write to its binary log. Monitor disk I/O on the primary.
Look for highiostat -xz 1%utilandawaittimes. - Fix: Optimize queries on the primary, upgrade the primary’s storage (e.g., to faster SSDs), or offload read traffic to replicas.
- Why it works: Faster writes to the primary’s binary log mean events are available for the replica to read sooner.
6. Replica Server Disk I/O Bottleneck
- Diagnosis: The replica server might be struggling to write data to its own disks as it applies binlog events. Monitor disk I/O on the replica.
Highiostat -xz 1%utilandawaiton the replica’s data directory disks indicate a bottleneck. - Fix: Optimize queries on the replica, ensure sufficient
innodb_buffer_pool_sizeto cache data, or upgrade the replica’s storage. - Why it works: Faster disk writes on the replica allow it to apply incoming events more quickly.
7. Replication Thread Contention (Multi-threaded Replication)
- Diagnosis: If using multi-threaded replication (
slave_parallel_workers> 0), checkSHOW PROCESSLISTfor multipleReplica_SQLthreads that might be stuck, or observe that parallel workers are not effectively reducing lag.SHOW PROCESSLIST; SHOW REPLICA STATUS\G # Look for Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master - Fix: Adjust
slave_parallel_workersandslave_parallel_type. Ifslave_parallel_type=LOGICAL_CLOCK, ensurebinlog_transaction_dependency_tracking=WRITESETon the primary. Ifslave_parallel_type=DATABASE, ensure the replica has sufficient resources.# my.cnf on replica slave_parallel_workers = 8 # Start with a value like 4 or 8 and tune slave_parallel_type = LOGICAL_CLOCK - Why it works: Properly configured multi-threaded replication allows the replica to apply events in parallel, significantly speeding up catch-up.
After fixing replication lag, you might encounter a "replica has a different set of tables" error if a DDL statement was executed on the primary after the replica fell behind and before you could fix it, especially if binlog_format is ROW.