Solution to MySQL replication failure caused by disk full

Time:2021-10-20

Case scenario

Today, a problem was found online. Because the monitoring was not covered, the disk of a machine was full, resulting in the problem of online MySQL master-slave replication. The questions are as follows:


localhost.(none)>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 10.xx.xx.xx
                  Master_User: replica
                  Master_Port: 5511
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-bin.001605
                Relay_Log_Pos: 9489761
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: No
                   Last_Errno: 13121
                   Last_Error: Relay log read failure: Could not parse relay log event entry. 
The possible reasons are: the master's binary log is corrupted (you can check this by running 
'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by 
running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a
 keyring key required to open an encrypted relay log file, or a bug in the master's or 
slave's MySQL code. If you want to check the master's binary log or slave's relay log, 
you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

Then check the error log and find that the contents in the error log are as follows:


2021-03-31T11:34:39.367173+08:00 11 [Warning] [MY-010897] [Repl] Storing MySQL user name or 
password information in the master info repository is not secure and is therefore not 
recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; 
see the 'START SLAVE Syntax' in the MySQL Manual for more information.

2021-03-31T11:34:39.368161+08:00 12 [ERROR] [MY-010596] [Repl] Error reading relay log 
event for channel '': binlog truncated in the middle of event; consider out of disk space

2021-03-31T11:34:39.368191+08:00 12 [ERROR] [MY-013121] [Repl] Slave SQL for channel '': Relay 
log read failure: Could not parse relay log event entry. The possible reasons are: the master's 
binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the 
slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log),
 a network problem, the server was unable to fetch a keyring key required to open an encrypted
 relay log file, or a bug in the master's or slave's MySQL code. If you want to check the 
master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW
 SLAVE STATUS' on this slave. Error_code: MY-013121

2021-03-31T11:34:39.368205+08:00 12 [ERROR] [MY-010586] [Repl] Error running query, slave SQL
 thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We 
stopped at log 'mysql-bin.000446' position 9489626

As can be seen from the description, the error log is relatively intelligent. It finds a disk problem and prompts us to “consider”   out   of   disk   space”

solve the problem

Log in to the server and soon find that the disk utilization of the server where MySQL is located has reached 100%. The reason for the problem is consistent with the content in the error log.

Solve the problem now. The basic idea is to clean up the disk files and then rebuild the replication relationship. This process seems relatively simple, but in practice, the following errors appear when building the replication relationship:

###   For gtid based replication, you want to rebuild the replication relationship
localhost.(none)>reset slave;
ERROR 1371 (HY000): Failed purging old relay logs: Failed during log reset

localhost.(none)>reset slave all;
ERROR 1371 (HY000): Failed purging old relay logs: Failed during log reset

Step 1: because replication is based on gtid, after directly recording the status of show slave status, you can reset slave again and use the change master statement to rebuild the replication relationship.

However, the above error message appears. Judging from the error message, MySQL cannot complete the operation of purge relay log, which seems unscientific. Well, since you can’t complete the operation of purge relay logs yourself, let me help you.

Step 2: manually RM -f delete all the relay logs, and the error message becomes:


localhost.(none)>reset slave all;
ERROR 1374 (HY000): I/O error reading log index file

Well, well, the problem hasn’t been solved.

After thinking about it, since you can’t manually reset slave to clean up the relay log, stop directly

How about slave and change master?

Step 3: directly stop slave and then change master without executing the reset slave all statement. The results are as follows:


localhost.(none)>change master to master_host='10.13.224.31',
    -> master_user='replica',
    -> master_password='eHnNCaQE3ND',
    -> master_port=5510,
    -> master_auto_position=1;
ERROR 1371 (HY000): Failed purging old relay logs: Failed during log reset

Well, the problem remains.

Step 4: anyway, the replication has reported an error and disconnected. Run a start slave to see the result. A dramatic scene appears:


localhost.(none)>start slave;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    262
Current database: *** NONE ***


Query OK, 0 rows affected (0.01 sec)


localhost.(none)>
[[email protected] ~]#

After the start slave is executed, the instance hangs directly.

At this point, the replication is completely disconnected and the instance from the library has been suspended.

Step 5: see if the instance can be restarted. Try to restart the instance and find that the instance can be restarted. After the instance is restarted, view the replication relationship. The results are as follows:


localhost.(none)>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 10.xx.xx.xx
                  Master_User: replica
                  Master_Port: 5511
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: relay-bin.001605
                Relay_Log_Pos: 9489761
        Relay_Master_Log_File:
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
                   Last_Errno: 13121
                   Last_Error: Relay log read failure: Could not parse relay log event entry.
 The possible reasons are: the master's binary log is corrupted (you can check this by running 
'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by 
running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a 
keyring key required to open an encrypted relay log file, or a bug in the master's or slave's 
MySQL code. If you want to check the master's binary log or slave's relay log, you will be able 
to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                 Skip_Counter: 0

The replication relationship still reports an error.

Step 6: reset slave all again. The result is successful.


localhost.(none)>stop slave;
Query OK, 0 rows affected (0.00 sec)


localhost.(none)>reset slave all;
Query OK, 0 rows affected (0.03 sec)

Step 7: rebuild the replication relationship and start replication


localhost.(none)>change master to master_host='10.xx.xx.xx',
    -> master_user='replica',
    -> master_password='xxxxx',
    -> master_port=5511,
    -> master_auto_position=1;
Query OK, 0 rows affected, 2 warnings (0.01 sec)


localhost.(none)>start slave;
Query OK, 0 rows affected (0.00 sec)


localhost.(none)>show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.xx.xx.xx
                  Master_User: replica
                  Master_Port: 5511
                Connect_Retry: 60
                          ...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

It is found that the instance replication relationship can be established.

A little summary

When the disk is full, the MySQL service cannot write data to the meta information table, and the relay log may be incomplete. If you directly clean up the disk data on the server and then change the master to modify the master-slave replication relationship, an error may occur and cannot be repaired directly, because this is not a normal scenario in which the master-slave replication relationship is broken.

Therefore, the correct approach should be:

1. Clean up the server’s disks

2. Restart the slave library whose replication relationship is broken

3. Reset slave all and change master to establish the master-slave replication relationship

If there is a better way, please don’t hesitate to give advice.

The above is the details of the solution for MySQL replication failure caused by disk fullness. For more information about the solution for MySQL replication failure, please pay attention to other related articles of developeppaer!