MySQL replication lag is a symptom of the replica server failing to apply events from the primary server’s binary log as quickly as the primary server is generating them.

Common Causes and Fixes for MySQL Replication Lag

1. Network Latency/Bandwidth Saturation

  • Diagnosis: Use ping and iperf3 between the primary and replica to check for high latency or low throughput.
    ping <replica_ip>
    iperf3 -c <replica_ip>
    
  • Fix: If network issues are identified, work with your network team to increase bandwidth or reduce latency. This might involve upgrading network hardware, optimizing routing, or moving servers closer physically.
  • Why it works: Faster, more reliable data transfer between primary and replica ensures the replica receives binlog events promptly.

2. Under-provisioned Replica Server Resources (CPU/RAM)

  • Diagnosis: Monitor replica server’s CPU and RAM utilization. High CPU usage (consistently > 80%) or swapping indicates resource starvation.
    top # or htop
    free -m
    
  • Fix: Upgrade the replica server’s hardware (more CPU cores, more RAM) or optimize its configuration.
    # my.cnf on replica
    innodb_buffer_pool_size = 4G # (adjust based on available RAM, typically 70-80%)
    innodb_log_file_size = 512M # (larger can help with write throughput)
    max_connections = 500 # (ensure enough for your workload)
    
  • Why it works: A more powerful replica server can process I/O and CPU-bound tasks faster, including applying binlog events.

3. Inefficient Queries on the Replica

  • Diagnosis: The replica’s SHOW PROCESSLIST might show long-running Applying event or Waiting for an event states, often correlating with slow queries on the replica itself. Check the replica’s slow query log.
    SHOW PROCESSLIST;
    # In my.cnf on replica:
    # slow_query_log = 1
    # slow_query_log_file = /var/log/mysql/mysql-slow.log
    # long_query_time = 2
    
  • Fix: Identify and optimize slow queries running on the replica. This could involve adding indexes, rewriting queries, or ensuring read_only is set on the replica to prevent accidental writes that could be slow.
    -- Example: Add an index to speed up a common replica query
    CREATE INDEX idx_user_id ON orders (user_id);
    
  • Why it works: By reducing the time spent on queries within the replica, more time is freed up to apply incoming binlog events.

4. Large Transactions on the Primary

  • Diagnosis: Large transactions on the primary can cause a significant backlog of events in the binary log. Check SHOW MASTER STATUS on the primary and SHOW REPLICA STATUS on the replica for Exec_Master_Log_Pos and Master_Log_File. If the replica is far behind in terms of positions within the same log file, large transactions might be the culprit.
    # On Primary
    SHOW MASTER STATUS;
    
    # On Replica
    SHOW REPLICA STATUS\G
    
  • Fix: Break down large transactions into smaller ones. Implement best practices like avoiding INSERT ... SELECT on very large tables without a LIMIT clause, or using batch inserts.
    -- Instead of:
    -- INSERT INTO huge_table SELECT * FROM another_huge_table WHERE ...;
    -- Use batching:
    -- INSERT INTO huge_table SELECT * FROM another_huge_table WHERE ... LIMIT 1000;
    -- Then repeat with a different LIMIT or WHERE clause.
    
  • Why it works: Smaller transactions generate fewer, smaller binlog events, allowing the replica to keep pace more easily.

5. Primary Server I/O Bottleneck

  • Diagnosis: If the primary server’s disk subsystem is saturated, it will be slow to write to its binary log. Monitor disk I/O on the primary.
    iostat -xz 1
    
    Look for high %util and await times.
  • Fix: Optimize queries on the primary, upgrade the primary’s storage (e.g., to faster SSDs), or offload read traffic to replicas.
  • Why it works: Faster writes to the primary’s binary log mean events are available for the replica to read sooner.

6. Replica Server Disk I/O Bottleneck

  • Diagnosis: The replica server might be struggling to write data to its own disks as it applies binlog events. Monitor disk I/O on the replica.
    iostat -xz 1
    
    High %util and await on the replica’s data directory disks indicate a bottleneck.
  • Fix: Optimize queries on the replica, ensure sufficient innodb_buffer_pool_size to cache data, or upgrade the replica’s storage.
  • Why it works: Faster disk writes on the replica allow it to apply incoming events more quickly.

7. Replication Thread Contention (Multi-threaded Replication)

  • Diagnosis: If using multi-threaded replication (slave_parallel_workers > 0), check SHOW PROCESSLIST for multiple Replica_SQL threads that might be stuck, or observe that parallel workers are not effectively reducing lag.
    SHOW PROCESSLIST;
    SHOW REPLICA STATUS\G # Look for Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master
    
  • Fix: Adjust slave_parallel_workers and slave_parallel_type. If slave_parallel_type=LOGICAL_CLOCK, ensure binlog_transaction_dependency_tracking=WRITESET on the primary. If slave_parallel_type=DATABASE, ensure the replica has sufficient resources.
    # my.cnf on replica
    slave_parallel_workers = 8 # Start with a value like 4 or 8 and tune
    slave_parallel_type = LOGICAL_CLOCK
    
  • Why it works: Properly configured multi-threaded replication allows the replica to apply events in parallel, significantly speeding up catch-up.

After fixing replication lag, you might encounter a "replica has a different set of tables" error if a DDL statement was executed on the primary after the replica fell behind and before you could fix it, especially if binlog_format is ROW.

Want structured learning?

Take the full Express course →