As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

Time:2020-11-26

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

This year, sometimes I don’t dare to jump if I don’t find a good place to go. I just jumped to my new employer a few days ago. On the day when I started to work, something happened. I was really scared out of a cold sweat (the boss kept staring at me and said that he wanted to solve the problem quickly). He almost got fired

What’s the situation? And listen to me slowly!!! I hope it can be helpful and reference for you.

Problem description

After an important MySQL customer’s table was upgraded from 5.6 to 5.7, a “duplicate key” error occurred during the insertion process on the master, which occurred on both the primary and standby instances and RO instances.

Take one of the tables as an example. The auto increment ID of the “show create table” command before migration is 1758609. After the migration, it becomes 1758598. The maximum value of the auto increment column of the new table generated by migration is 1758609 by max.

Users use InnoDB engine, and according to the introduction of operation and maintenance students, they have encountered similar problems before, and they can return to normal after restart.

Kernel troubleshooting

Due to the user‘s feedback that the access is normal on 5.6, an error is reported after switching to 5.7. Therefore, the first thing to suspect is that there is something wrong with the 5.7 kernel, so the first reaction is to search the official bug list to see if there are similar problems, so as to avoid repeated car building. After searching, it is found that there is a similar bug in the government. Here is a brief introduction to this bug.

Background knowledge 1

Auto increment parameters and data structure of InnoDB engine.

The main parameters include InnoDB_ autoinc_ lock_ Mode is used to control the lock mode of self increment_ increment_ increment, auto_ increment_ Offset is used to control the increment interval and start offset of auto increment column.

The main structures involved include: data dictionary structure, which saves the current auto increment value of the whole table and protection lock; transaction structure, which saves the number of rows processed inside the transaction; handler structure, which saves the loop iteration information of multiple rows in the transaction.

Background knowledge 2

The process of accessing and modifying autoincrement in MySQL and InnoDB engine

  • (1) Data dictionary structure (dict)_ table_ t) Save and restore the value of autoincrement when changing in and out. When swapping out, save autoincrement in the global mapping table, and then eliminate dict in memory_ table_ t。 Restore the global mapping table by dict_ table_ T structure. The related function is dict_ table_ add_ To_ Cache and Dict_ table_ remove_ from_ cache_ low。
  • (2) row_ Import, table truncate process updates autoincrement.
  • (3) When the handler opens for the first time, it will query the value of the largest auto increment column in the current table and initialize the data of the table with the value of the largest column plus 1_ dict_ The value of autoint in the structure.
  • (4) Insert process. The related stack for modifying autoinc is as follows:
Ha_ innobase::write_ row:write_ The update_ in the handler handle is called in the third step of row. auto_ The increment function updates the value of auto increment.
handler::update_ auto_ Increment: the InnoDB interface is called to obtain a self increment, and according to the current auto_ At the same time, set the value of the next auto increment column to be processed by the current handler.
Ha_ innobase::get_ auto_ Increment: get Dict_ The current auto increment value in the tab, and updates the next auto increment value to the data dictionary according to the global parameters
Ha_ innobase::dict_ table_ autoinc_ Initialize: updates the value of auto increment, and updates if the specified value is larger than the current value.
handler::set_ next_ insert_ ID: sets the value of the auto increment column of the next row to be processed in the current transaction.
  • (5) update_ row。 For the “insert into t (C1, C2) values (x, y) on duplicate key update” statement, whether the row pointed to by the unique index column exists or not, the value of auto increment needs to be pushed forward. The relevant codes are as follows:
if (error == DB_SUCCESS
    && table->next_number_field
    && new_row == table->record[0]
    && thd_sql_command(m_user_thd) == SQLCOM_INSERT
    && trx->duplicates)  {
    ulonglong    auto_inc;
       ……
    auto_inc = table->next_number_field->val_int();
    auto_inc = innobase_next_autoinc(auto_inc, 1, increment, offset, col_max_value);
    error = innobase_set_max_autoinc(auto_inc);
       ……
}

Judging from our actual business process, our errors may only involve the insert and update processes.

BUG 76872 / 88321: "InnoDB AUTO_INCREMENT produces same value twice"
  • (1) Bug overview: when autoinc_ lock_ Mode is greater than 0, and auto_ increment_ When the increment is greater than 1, the system will generate a “duplicate key” error after the system restarts.
  • (2) Cause analysis: after restart, InnoDB will set the autoincrement value to max (ID) + 1.

At this point, when inserting for the first time, write_ The row process calls handler:: update_ auto_ Increment to set the information related to autoinc. First, through ha_ innobase::get_ auto_ Increment gets the value of the current autoincrement (i.e. max (ID) + 1), and modifies the value of the next autoincrement to next according to the relevant parameters of autoincrement_ id。

When auto_ increment_ When increment is greater than 1, max (ID) + 1 will not be greater than next_ id。 handler::update_ auto_ After increment gets the value returned by the engine layer, in order to prevent some engines from taking the current auto increment parameter into account when calculating self increment, it will recalculate the self increment value of the current row according to the parameters. Since the global parameters are considered in InnoDB, the self increment value calculated by handle layer for the auto increment ID returned by InnoDB is also next_ ID, an auto increment ID of next will be inserted_ ID.

The handler layer will be in write_ Row ends according to the value of the current row next_ ID sets the next autoincrement value. If you are writing_ Row has not set the next autoincrement period of the table. If another thread is in the process of inserting, the self increment it gets will also be next_ id。 This leads to repetition.

  • (3) Solution: consider the global autoincrement parameter when getting auto increment columns inside the engine. In this way, the self increment obtained by the first insertion thread after restart is not max (ID) + 1, but next_ ID, and then according to next_ ID sets the value of the next autoincrement. Because this process is locked, other threads will not get duplicate values when they get autoincrement again.

Through the above analysis, this bug is only found in autoinc_ lock_ Mode > 0 and auto_ increment_ Increase > 1. In the actual online business, both parameters are set to 1. Therefore, the possibility of online problems caused by this bug can be ruled out.

Field analysis and recurrence verification

Since the official bug has failed to solve our problem, we have to work on our own to analyze the error phenomenon.

(1) The rules of Max ID and autoincrement are analyzed because the user’s table is set with on update current_ The timestamp column, so you can grab the max ID, autoincrement and several recently updated records of all the tables in error to see if there is any rule. The information captured is as follows:

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

At first glance, this error is quite regular. The column update time is the last time to insert or modify. Combined with the values of auto increment and Max ID, the phenomenon is very similar to that the last batch of transactions only updated the auto increment ID of the row and did not update the value of auto increment.

Associated with the introduction of the usage of auto increment in the official document, the update operation can only update the auto increment ID without triggering auto increment advancement. According to this idea, I tried to reproduce the user’s scene. The reproduction method is as follows:

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

At the same time, in binlog, we can also see the operation of auto adding columns by update. As shown in the figure:

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

However, since binlog is in row format, we can not judge whether it is caused by the kernel problem, which causes the change of auto increment column or the user’s own update. Therefore, we contacted the customer for confirmation. As a result, the user was very sure that the operation of updating the auto increment column was not carried out.

So how did these auto add columns come from?

(2) After analyzing the user’s table and SQL statement, it is found that there are three types of tables in total

hz_notice_stat_sharding
hz_notice_group_stat_sharding
hz_freeze_balance_sharding

All three tables have auto increment primary keys.

However, there are autoinc errors in the first two, only Hz_ freeze_ balance_ There are no errors in the sharding table. Is it that users have different access to the two tables? Grab the user’s SQL statements. Sure enough, the first two tables use the replace into operation, and the last table uses the update operation. Is the problem caused by the replace into statement? A search for the official bug found another suspected bug.

bug #87861: “Replace into causes master/slave have different auto_increment offset values”

reason:

  • (1) MySQL actually implements replace into through delete + insert statement, but in row binlog format, update type log will be recorded to binlog. The insert statement updates autoincrement synchronously, but update does not.
  • (2) Replace into operates in delete + insert mode on the master, and autoincrement is normal. After being copied to the slave based on the row format, the slave will play back according to the update operation. Only the value of the auto increment key in the row will be updated, and the autoincrement will not be updated.

Therefore, max (ID) is greater than autoincrement on the slave machine. At this time, in row mode, binlog records the values of all columns for the insert operation. When playing back on the slave, the auto increment ID will not be redistributed, so no error will be reported. However, if the slave cuts the master, a “duplicate key” error will appear when the insert operation is encountered.

  • (3) Since the user migrates from 5.6 to 5.7, and then inserts directly on 5.7, it is equivalent to that the slave switches to the master, so an error will be reported.

Solution

Possible solutions on the business side:

  • (1) Change binlog to mixed or statement format.
  • (2) Replace replace into with insert on duplicate key update.

Possible solutions on the kernel side:

  • (1) If replace into statement is encountered in row format, log event in statement format is recorded, and original statement is recorded in binlog.
  • (2) In row format, log event of replace into statement is recorded as a delete event and an insert event.

Experience

  • (1) Autoinc of autoincrement_ lock_ Mode and auto_ increment_ The change of the two parameters, increment, is easy to cause duplicate keys. Dynamic modification should be avoided in the process of use.
  • (2) When encountering online problems, first of all, we should do a good job of on-site analysis to clarify the scene of the failure, the user’s SQL statements, the scope of the failure, and at the same time, we should backup the configuration information, binlog, and even instance data involved in the instance, so as to prevent the loss of expiration date.

Only in this way can we accurately match the scene when finding official bugs. If there are no official bugs, we can analyze them independently through the existing clues.

Author: Tencent database technology
Source:http://r6e.cn/df8b

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back

As soon as I got on the job, I met such a big bug as MySQL! I almost walked away with the pot on my back