The home page of an English learning app has a function of randomly displaying words, that is, there is a word list according to the level of each user, and then each time the user visits the home page, three words will be scrolled randomly. They found that as the word list gets bigger, the logic of choosing words becomes slower and slower, even affecting the opening speed of the home page.
Simplification: remove the logic that each level of user has a corresponding word list, and directly select three words randomly from a word list. The table creation statement and initial data command of this table are as follows:
mysql> CREATE TABLE `words` ( `id` int(11) NOT NULL AUTO_INCREMENT, `word` varchar(64) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; delimiter ;; create procedure idata() begin declare i int; set i=0; while i<10000 do insert into words(word) values(concat(char(97+(i div 1000)), char(97+(i % 1000 div 100)), char(97+(i % 100 div 10)), char(97+(i % 10)))); set i=i+1; end while; end;; delimiter ; call idata();
In this table, 10000 rows of records are inserted, and three words are randomly selected.
Use order by rand() to sort the first three randomly
mysql> select word from words order by rand() limit 3;
Use the explain command to view the execution.
The extra field displays using temporary, which indicates that the temporary table needs to be used, and using filesort, which indicates that the sorting operation needs to be performed.
Extra: you need a temporary table and you need to sort on it.
For InnoDB tables, performing full field sorting reduces disk access and is therefore preferred.
For the memory table, the process of returning to the table simply accesses the memory directly to get the data according to the location of the data row, which will not lead to multiple access to the disk at all. If the optimizer does not have this concern, it will give priority to the less rows used for sorting, the better. Therefore, MySQL will choose ROWID for sorting.
The execution process of this statement is as follows:
- Create a temporary table. This temporary table uses the memory engine. There are two fields in the table. The first field is of double type. For the convenience of later description, it is marked as field R, and the second field is of varchar (64) type, which is marked as field W. Also, the table is not indexed.
- From the words table, take out all the word values in primary key order. For each word value, call the rand() function to generate a random decimal greater than 0 but less than 1, and store the random decimal and word in the R and W fields of the temporary table respectively. Up to now, the number of scanned rows is 10000.
- Now that the temporary table has 10000 rows of data, you need to sort by field r on the memory temporary table without index.
- Initialize sort_ buffer。 sort_ There are two fields in the buffer, one is double type, the other is integer type.
- Take the R value and location information from the temporary memory table row by row (I will explain why it is “location information” later) and store them in sort_ Two fields in the buffer. In this process, a full table scan is performed on the memory temporary table. At this time, the number of scan rows increases by 10000 to 20000.
- In sort_ The buffer is sorted according to the value of R. Note that there is no table operation involved in this process, so the number of scanned rows will not be increased.
- After sorting, the location information of the first three results is taken out, and then the word value is taken out from the memory temporary table and returned to the client. In this process, three rows of data in the table are accessed, and the total number of scanned rows becomes 20003.
Through the slow log to verify whether the number of scan lines we analyzed is correct
# Query_time: 0.900376 Lock_time: 0.000347 Rows_sent: 3 Rows_examined: 20003 SET timestamp=1541402277; select word from words order by rand() limit 3;
Rows_ Examined: 20003 means that 20003 lines are scanned during the execution of this statement, which verifies the conclusion of our analysis.
Complete sorting execution flow chart:
The POS in the figure is the location information.
How do MySQL tables locate “one row of data”.
If you delete the primary key of an InnoDB table, will there be no primary key and no way to return to the table?
- If you create a table without a primary key, or delete the primary key of a table, InnoDB will generate a ROWID of 6 bytes as the primary key.
This is the origin of the ROWID name in the sorting mode. In fact, it represents the information that each engine uses to uniquely identify the data row.
- For InnoDB tables with primary keys, the ROWID is the primary key ID;
- For InnoDB tables without primary key, the ROWID is generated by the system;
- The memory engine is not an index organizing table. In this case, you can think of it as an array. Therefore, the ROWID is actually the subscript of the array.
Order by rand() uses the memory temporary table. ROWID is used to sort the memory temporary table.
Not all temporary tables are memory tables, TMP_ table_ Size this configuration limits the size of the memory temporary table. The default value is 16m. If the temporary table size exceeds TMP_ table_ Then the memory temporary table will be converted to the disk temporary table.
The engine used by the disk temporary table is InnoDB by default, which is controlled by the parameter internal_ tmp_ disk_ storage_ Engine controlled.
When the disk temporary table is used, the corresponding sorting process is an InnoDB table without explicit index.
Repeat this process, TMP_ table_ Set the size to 1024 and set the sort_ buffer_ Set size to 32768 and max_ length_ for_ sort_ Data is set to 16.
set tmp_table_size=1024; set sort_buffer_size=32768; set max_length_for_sort_data=16; /*Open optimizer_ Trace, only valid for this thread*/ SET optimizer_trace='enabled=on'; /*Execution statement*/ select word from words order by rand() limit 3; /*View optimizer_ Trace output*/ SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G
OPTIMIZER_ The results of trace were as follows
Set max_ length_ for_ sort_ Data is set to 16, which is less than the length definition of word field, so sort_ The ROWID sorting is displayed in the mode, and the rows composed of random value r field and ROWID field participate in the sorting.
The random value stored in the R field is 8 bytes, ROWID is 6 bytes, and the total number of data is 10000. In this way, there are 140000 bytes, more than sort_ buffer_ Size defines 32768 bytes. But number_ of_ tmp_ The value of files is 0. Don’t you need to use temporary files?
This sort of SQL statement does not use temporary files. It adopts a new sort algorithm introduced by MySQL version 5.6, namely priority queue sort algorithm. Next, let’s see why the priority queue sorting algorithm is used instead of the temporary file algorithm, that is, the merge sorting algorithm
In fact, the current SQL statement only needs to take the three rowids with the smallest r value. However, if you use the merge sort algorithm, you can get the first three values, but after the end of this algorithm, 10000 rows of data have been sorted.
In other words, the following 9997 lines are also ordered. However, our query does not need the data to be ordered. So, if you think about it, it wastes a lot of computation.
The priority queue algorithm can precisely get only three minimum values
- For the 10000 (R, ROWID) to be sorted, the first three rows are taken to construct a heap;
- (students who have a vague impression of data structure can first imagine that this is an array composed of three elements.)
- Take the next row (R ‘, ROWID’) and compare it with the largest R in the current heap. If R ‘is less than R, remove the (R, ROWID) from the heap and replace it with (R’, ROWID ‘);
- Repeat step 2 until the 10000th (R ‘, ROWID’) is compared.
Schematic diagram of priority queue sorting process:
Figure 6 is the process of simulating six (R, ROWID) rows and finding the smallest three rows with R value by priority queue sorting. During the whole sorting process, in order to get the maximum value of the current heap as quickly as possible, the maximum value is always kept at the top of the heap, so this is a maximum heap.
Optimizer of Figure 5_ In trace results, filesort_ priority_ queue_ Chosen = true in the optimization section indicates that the priority queue sorting algorithm is used. This process does not require temporary files, so the corresponding number_ of_ tmp_ Files is 0.
At the end of this process, in the constructed heap, there are three rows with the lowest r value in the 10000 rows. Then, take out their ROWID in turn and go to the temporary table to get the word field.
Select city, name, age from t where city ='hangzhou 'order by name limit 1000;
This SQL statement is limit 1000. If the priority queue algorithm is used, the size of the heap to be maintained is 1000 rows (name, ROWID), which exceeds the sort I set_ buffer_ Size, so only merge sort algorithm can be used.
No matter what type of temporary table is used, order by Rand () will make the calculation process very complicated, requiring a large number of scanning rows, so the resource consumption of sorting process will be very large.
If only one word value is randomly selected, what can be done? Thinking:
- Get the maximum m and minimum n of the primary key ID of this table;
- A random function is used to generate a number x = (m-n) * rand () + n from the maximum to the minimum;
- Take a row that is not less than the first ID of X.
This algorithm is temporarily called random algorithm 1
mysql> select max(id),min(id) into @M,@N from t ; set @X= floor((@[email protected]+1)*rand() + @N); select * from t where id >= @X limit 1;
This method is very efficient, because max (ID) and min (ID) do not need to scan the index, and the third step of select can also use the index to quickly locate, it can be considered that only 3 lines are scanned. But in fact, the algorithm itself does not strictly meet the random requirements of the topic, because there may be holes in the ID, so the probability of choosing different peers is not the same, it is not really random.
For example, there are four IDS, which are 1, 2, 4 and 5. According to the above method, the probability of getting the line with id = 4 is twice that of getting other lines.
What if the IDs of these four lines are 1, 2, 40000 and 40001? This algorithm can be regarded as a bug.
To get strictly random results, you can use the following process:
- Get the number of rows in the whole table and record it as C.
- Get y = floor (c * rand()). The floor function here takes the integer part.
- Then use limit y, 1 to get a line.
This algorithm is called random algorithm 2. The following code is the sequence of execution statements of the above process.
mysql> select count(*) into @C from t; set @Y = floor(@C * rand()); set @sql = concat("select * from t limit ", @Y, ",1"); prepare stmt from @sql; execute stmt; DEALLOCATE prepare stmt;
Since the parameters after limit cannot be directly followed by variables, the method of prepare + execute is used in the above code. The method of splicing SQL statements can also be written in the application, which will be simpler.
This random algorithm 2 solves the obvious problem of uneven probability in algorithm 1.
MySQL’s way of processing limit y, 1 is to read out one by one in order, discard the first y records, and then take the next record as the return result. Therefore, this step needs to scan y + 1 lines. In addition, the first step of scanning C line, a total of scanning C + y + 1 line, the cost of execution is higher than that of random algorithm 1.
Of course, the execution cost of random algorithm 2 is much lower than that of direct order by rand().
If we follow the idea of random algorithm 2 and randomly take three word values?
- Get the number of rows in the whole table, and record it as C;
- According to the same random method, Y1, Y2 and Y3 are obtained;
- Then execute three limit y, 1 statements to get three lines of data.
This algorithm is called random algorithm 3. The following code is the sequence of execution statements of the above process.
mysql> select count(*) into @C from t; set @Y1 = floor(@C * rand()); set @Y2 = floor(@C * rand()); set @Y3 = floor(@C * rand()); Select * from t limit @ Y1, 1; // take Y1, Y2, Y3 values in the application code, spell out SQL and execute select * from t limit @Y2，1； select * from t limit @Y3，1；
Through three times limit y, 1 to get the required data, further optimization method.
Take the largest number in Y1, Y2 and Y3, record it as m, and the smallest number as N, and then execute the following SQL statement:
mysql> select * from t limit N, M-N+1;
In addition, the total number of scan rows in this scheme only needs C + m + 1 rows.
Of course, you can also retrieve the ID value first. After determining three ID values in the application, you can execute the statement where id = x three times.
Now we maintain a trading system, in which the trading record table tradelog contains the trading serial number (trade ID), trader ID (operator), and trading time (T)_ Modified) and so on. Ignore other fields. The table creation statement of this table is as follows:
mysql> CREATE TABLE `tradelog` ( `id` int(11) NOT NULL, `tradeid` varchar(32) DEFAULT NULL, `operator` int(11) DEFAULT NULL, `t_modified` datetime DEFAULT NULL, PRIMARY KEY (`id`), KEY `tradeid` (`tradeid`), KEY `t_modified` (`t_modified`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Assuming that all data from the beginning of 2016 to the end of 2018 have been recorded, the operation Department has a demand to count the total number of transactions in July in all years.
mysql> select count(*) from tradelog where month(t_modified)=7;
Because t_ There is an index in the modified field, so this statement is executed in the production library, but it takes a long time to return the result.
Here’s the t_ Diagram of the modified index. The number above the box is the value of the month() function.
If your SQL statement condition is where t_ Modified =’2018-7-1 ‘, the engine will quickly locate t according to the green arrow above_ Modified =’2018-7-1 ‘required results.
In fact, the fast location capability provided by B + tree comes from the order of sibling nodes in the same layer.
However, if you calculate the month() function, you will see that when 7 is passed in, you don’t know what to do in the first layer of the tree.
It should be noted that the optimizer is not abandoning this index.
In this example, the tree search function is abandoned, and the optimizer can choose to traverse the primary key index or the index t_ After comparing the index size, the optimizer finds that the index t_ Modified is smaller, and traversing this index is faster than traversing the primary key index. Therefore, index T will be chosen in the end_ modified。
Use the explain command to view the execution result of this SQL statement
key=”t_ “Modified” means that t is used_ Modified this index; I inserted 100000 rows of data into the test table data, rows = 100335, indicating that this statement scans all the values of the whole index; the using index of the extra field indicates that the overlay index is used.
Because in t_ The modified field is operated by the month() function, resulting in a full index scan. In order to use the quick positioning ability of the index, we need to change the SQL statement to a range query based on the field itself. In the following way, the optimizer can use t as we expect_ The modified index has the ability of fast positioning.
mysql> select count(*) from tradelog where -> (t_modified >= '2016-7-1' and t_modified<'2016-8-1') or -> (t_modified >= '2017-7-1' and t_modified<'2017-8-1') or -> (t_modified >= '2018-7-1' and t_modified<'2018-8-1');
Of course, if the system goes online earlier, or the data of later years is inserted later, it is necessary to make up for other years.
Because of the month() function, MySQL can no longer use the quick index location function, but can only use full index scanning.
However, the optimizer does have “lazy” behavior on this issue, even for functions that do not change the order, it will not consider using indexes. For example, for the SQL statement select * from tradelog where id + 1 = 10000, this plus 1 operation will not change the order, but the MySQL optimizer still can’t quickly locate the 9999 line with the ID index. Therefore, you need to manually rewrite the SQL statement to where id = 10000 – 1.
mysql> select * from tradelog where tradeid=110717;
There is an index in the trade ID field, but the result of explain shows that this statement needs to be scanned through the whole table. You may also find that the field type of trade ID is varchar (32), but the input parameter is integer, so you need to do type conversion.
Here are two questions
- What are the rules of data type conversion?
- Why do you need full index scanning for data type conversion?
There are so many types in the database, and there are more data type conversion rules. What should we do?
Simple method: look at the result of select “10” > 9
- If the rule is “convert a string to a number”, it is to compare numbers, and the result should be 1;
- If the rule is “convert a number to a string”, then string comparison is done, and the result should be 0.
Select “10” > 9 returns 1, so you can confirm the conversion rule in MySQL: in mysql, the comparison between a string and a number is to convert a string to a number.
Full table scan statement:
mysql> select * from tradelog where tradeid=110717;
For the optimizer, this statement is equivalent to:
mysql> select * from tradelog where CAST(tradid AS signed int) = 110717;
In other words, this statement triggers the optimizer to give up the tree search function by performing function operations on the index field.
Suppose there is another table trade in the system_ Detail, used to record the operation details of the transaction. To facilitate quantitative analysis and reproduction, you can go to the transaction log table tradelog and the transaction details table trade_ Insert some data into these two tables.
mysql> CREATE TABLE `trade_detail` ( `id` int(11) NOT NULL, `tradeid` varchar(32) DEFAULT NULL, `trade_ Step ` int (11) default null, / * operation steps*/ `step_ Info ` varchar (32) default null, / * step information*/ PRIMARY KEY (`id`), KEY `tradeid` (`tradeid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; insert into tradelog values(1, 'aaaaaaaa', 1000, now()); insert into tradelog values(2, 'aaaaaaab', 1000, now()); insert into tradelog values(3, 'aaaaaaac', 1000, now()); insert into trade_detail values(1, 'aaaaaaaa', 1, 'add'); insert into trade_detail values(2, 'aaaaaaaa', 2, 'update'); insert into trade_detail values(3, 'aaaaaaaa', 3, 'commit'); insert into trade_detail values(4, 'aaaaaaab', 1, 'add'); insert into trade_detail values(5, 'aaaaaaab', 2, 'update'); insert into trade_detail values(6, 'aaaaaaab', 3, 'update again'); insert into trade_detail values(7, 'aaaaaaab', 4, 'commit'); insert into trade_detail values(8, 'aaaaaaac', 1, 'add'); insert into trade_detail values(9, 'aaaaaaac', 2, 'update'); insert into trade_detail values(10, 'aaaaaaac', 3, 'update again'); insert into trade_detail values(11, 'aaaaaaac', 4, 'commit');
To query all the operation steps of the transaction with id = 2:
mysql> select d.* from tradelog l, trade_ Detail D where D. trade = L. trade and L. id = 2; / * statement Q1*/
- The first row shows that the optimizer will first find the row with id = 2 in the trade log table. In this step, the primary key index is used. Rows = 1 means that only one row is scanned;
- The second line, key = null, indicates that the transaction details table trade is not used_ The trade index on detail is scanned.
In this execution plan, take the trade ID field from the tradelog table, and then go to trade_ Query matching fields in detail table. Therefore, tradelog is called driver table, and trade is called driver table_ Detail is called driven table and trade is called associated field.
The execution process represented by the explain result is as follows:
In the picture:
- The first step is to find the line L2 in the tradelog table according to the ID;
- The second step is to extract the value of trade field from L2;
- The third step is to go to trade according to the trade ID value_ The detail table looks for rows that match the criteria. The key = null in the second line of the explain result indicates that this process is to determine whether the trade ID values match one by one by traversing the primary key index.
At this point, we found that step 3 did not meet the expectations. Because watch trade_ In detail, there is an index in the trade ID field. Originally, we wanted to quickly locate the equivalent row by using the trade ID index. However, there is no one here. Because the character sets of the two tables are different, one is utf8 and the other is utf8mb4, the index of the associated field is not used when making table join queries.
Why can’t indexes be used when character sets are different?
The problem lies in the third step of the execution step. If this step is changed to SQL statement, it is:
mysql> select * from trade_detail where tradeid=$L2.tradeid.value;
Of which, $L2 tradeid.value The character set of is utf8mb4.
The character set utf8mb4 is a superset of utf8, so when comparing the two types of strings, the internal operation of MySQL is to first convert the UTF8 string into the utf8mb4 character set, and then do the comparison.
Utf8mb4 is a superset of utf8. Similarly, in the programming language, when doing automatic type conversion, in order to avoid data errors caused by truncation in the process of data conversion, they are also converted “according to the direction of data length increase”.
Therefore, when executing the above statement, you need to convert the fields in the driven data table into utf8mb4 one by one, and then compare with L2.
In fact, this statement is equivalent to the following writing:
select * from trade_detail where CONVERT(traideid USING utf8mb4)=$L2.tradeid.value;
Convert() function, which means to convert the input string into utf8mb4 character set.
The principle is triggered again: the optimizer will give up the tree search function if it performs function operation on the index field.
Character set difference is only one of the conditions,It is required to add function operation to the index field of the driven table in the connection process, is the direct cause of the full table scan of the driven table.
Comparative verification: “find trade”_ For the operation with id = 4 in the detail table, who is the corresponding operator? Let’s look at this statement and its execution plan.
mysql>select l.operator from tradelog l , trade_detail d where d.tradeid=l.tradeid and d.id=4;
In this sentence, trade_ The detail table becomes the driver table, but the second line of the explain result shows that this query operation uses the index in the driven table tradelog, and the number of scanned rows is 1.
This is also the join operation of two trade fields. Why can we use the trade index of the driven table this time?
Hypothesis driven table trade_ The row with id = 4 in detail is denoted as R4. When connecting (step 3 in Figure 5), the driven table tradelog executes SQL statements like this:
select operator from tradelog where traideid =$R4.tradeid.value;
This is $R4 tradeid.value The character set of utf8 is utf8. According to the character set conversion rules, it is converted to utf8mb4, so this process is rewritten as follows:
select operator from tradelog where traideid =CONVERT($R4.tradeid.value USING utf8mb4);
The convert function here is added to the input parameter, so that the traideid index of the driven table can be used.
select d.* from tradelog l, trade_detail d where d.tradeid=l.tradeid and l.id=2;
There are two ways to execute this statement:
- A common optimization method is to use trade_ The character set of trade field in detail table is also changed to utf8mb4, so there is no problem of character set conversion.
alter table trade_detail modify tradeid varchar(32) CHARACTER SET utf8mb4 default null;
- If you can modify the character set of a field, it’s best. However, if the amount of data is large, or the DDL cannot be done in the business for the time being, the SQL statement can only be modified.
mysql> select d.* from tradelog l , trade_detail d where d.tradeid=CONVERT(l.tradeid USING utf8) and l.id=2;
By actively converting L. trade ID to utf8, the character encoding conversion on the driven table is avoided.
The table structure is as follows:
mysql> CREATE TABLE `table_a` ( `id` int(11) NOT NULL, `b` varchar(10) DEFAULT NULL, PRIMARY KEY (`id`), KEY `b` (`b`) ) ENGINE=InnoDB;
Suppose that there are 1 million rows of data in the table, and the value of B in 100000 rows of data is’ 1234567890 ‘. Suppose that the statement is written as follows:
mysql> select * from table_a where b='1234567890abcd';
At this time, how will MySQL execute?
Ideally, if MySQL sees that the field B is defined as varchar (10), it must return null. But MySQL doesn’t do that.
The execution of this SQL statement is very slow. The process is as follows:
- When passing to the engine for execution, the character is truncated. Because this line in the engine only defines a length of 10, only the first 10 bytes are truncated, that is, ‘1234567890’ is used for matching;
- In this way, there are 100000 lines of data that meet the conditions;
- Because it’s select *, it’s necessary to do 100000 returns;
- However, after each return to the table, find out the whole row and go to the server layer to judge that the value of B is not ‘1234567890abcd’;
- The return result is empty.
Although the execution process may be through the function operation, but ultimately after getting the results, the server layer still has to make a round of judgment.
Construct a table with two fields ID and C, and insert 100000 rows of records.
mysql> CREATE TABLE `t` ( `id` int(11) NOT NULL, `c` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; delimiter ;; create procedure idata() begin declare i int; set i=1; while(i<=100000)do insert into t values(i,i); set i=i+1; end while; end;; delimiter ; call idata();
mysql> select * from t where id=1;
Query results are not returned for a long time.
Generally, in this case, the probability is that table t is locked.
Use the show processlist command to view the waiting for table metadata lock diagram.
This state indicates that a thread is requesting or holding an MDL write lock on table t, blocking the select statement.
MySQL version 5.7 has modified the locking policy of MDL, and the steps of this scenario are repeated in MySQL version 5.7.
Session a holds the MDL write lock of table t through the lock table command, while Session B’s query needs to obtain the MDL read lock. Therefore, Session B enters the waiting state.
The way to deal with this kind of problem is to find out who holds the MDL write lock, and then kill it.
In the result of show processlist, the command column of session a is “sleep”, which makes it very inconvenient to search. But with performance_ After schema and sys system library, it is much more convenient. (performance needs to be set when MySQL starts_ Schema = on, there will be about 10% performance loss compared with setting it to off)
By inquiry sys.schema_ table_ lock_ In the waits table, we can directly find out the process ID that causes the blocking and disconnect the connection with the kill command.
On table t, execute the following SQL statement:
mysql> select * from information_schema.processlist where id=1;
Find out that the status of this thread is waiting for table flush
This state indicates that a thread is about to flush the table t. In mysql, there are two usages for flushing a table:
flush tables t with read lock; flush tables with read lock;
These two flush statements, if table t is specified, only table t is closed; if no specific table name is specified, all open tables in MySQL are closed.
But normally, these two statements are executed very quickly, unless they are blocked by other threads.
Therefore, the possible situation of waiting for table flush state is that a flush tables command is blocked by another statement, and then it blocks our select statement.
Steps of recurrence
In session a, I deliberately call sleep (1) once per row, so that this statement will execute for 100000 seconds by default. During this period, table t is always “opened” by session a. Then, if the flush tables t command of Session B wants to close table t again, it needs to wait for the end of session a’s query. In this way, if session C wants to query again, it will be blocked by the flush command.
Figure 7 shows the show processlist result of this replay step.
After the research of table level lock, our select statement finally came to the engine.
mysql> select * from t where id=1 lock in share mode;
Since the read lock is required when accessing the record id = 1, if a transaction already holds a write lock on the record, our select statement will be blocked.
Reproduction steps and scene:
Session a starts the transaction, holds the write lock, and does not commit, which is the reason why Session B is blocked.
How to find out who’s holding the lock. If you are using MySQL version 5.7, you can sys.innodb_ lock_ The waits table was found.
mysql> select * from t sys.innodb_lock_waits where locked_table=`'test'.'t'`\G
As you can see, this information is very complete, and thread 4 is the main cause of blocking. The way to kill the culprit is kill query 4 or kill 4.
However, “kill query 4” should not be shown here. This command means to stop the statement currently being executed by thread 4, but this method is actually useless. Because it is the update statement that holds the row lock. This statement has been executed before. Now kill query is executed, and the row lock on id = 1 cannot be removed from this transaction.
In fact, kill 4 is only effective, that is to say, directly disconnect the connection. The implicit logic here is that when the connection is disconnected, the thread executing in the connection will be automatically rolled back, and the row lock on id = 1 will be released.
Examples of slow queries:
mysql> select * from t where c=50000 limit 1;
Since there is no index on field C, this statement can only be scanned in ID primary key order, so 50000 rows need to be scanned.
As a confirmation, you can look at the slow query log. Note that in order to record all statements in the slow log, I execute set long after connecting_ query_ Time = 0, set the time threshold of slow query log to 0.
Rows_ Examined showed that 50000 lines were scanned. It takes 11.5 MS to return. Generally, it takes more than 1 second for slow query to be configured online. However:Bad queries are not necessarily slow queries. In our example, there are only 100000 lines of records. If the amount of data is large, the execution time will increase linearly. There are many scanning lines, so the execution is slow.
A statement that scans only one line but executes slowly.
As shown in Figure 12, this is the slow log of this example. As you can see, the executed statement is
mysql> select * from t where id=1；
Although the number of scan lines is 1, the execution time is as long as 800 ms.
If I pull down the screenshot of the slow log a little bit, you can see the next statement, select * from t where id = 1 lock in share mode. The number of rows scanned during execution is 1, and the execution time is 0.2 Ms.
In the query result of the first statement, C = 1, and the statement with lock in share mode returns C = 1000001.
Session a starts a transaction with the start transaction with consistent snapshot command, and then Session B starts the update statement.
What is the status of the line id = 1 after Session B executes the update statement one million times?
Session B updates 1 million times and generates 1 million undo logs.
The SQL statement with lock in share mode is the current read, so it will directly read the result of 1000001, so the speed is very fast; while the statement select * from t where id = 1 is consistent read, so it needs to start from 1000001, execute undo log in turn, and execute 1 million times before returning the result of 1.
Note that the undo log actually records such operation logic as “change 2 to 1” and “change 3 to 2”. The purpose of drawing minus 1 is to make it easy to see the figure.
Create table and initialize statement:
CREATE TABLE `t` ( `id` int(11) NOT NULL, `c` int(11) DEFAULT NULL, `d` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `c` (`c`) ) ENGINE=InnoDB; insert into t values(0,0,0),(5,5,5), (10,10,10),(15,15,15),(20,20,20),(25,25,25);
In addition to the primary key ID, this table also has an index C. the initialization statement inserts six rows of data into the table.
Question: how is the following statement sequence locked and when is the lock released?
begin; select * from t where d=5 for update; commit;
This statement hits the line with d = 5, and the corresponding primary key id = 5. Therefore, after the select statement is executed, a write lock will be added to the line with id = 5. Moreover, due to the two-stage lock protocol, the write lock will be released when the commit statement is executed.
Since there is no index on field D, this query will do a full table scan. So, will other 5-line records that are scanned but do not meet the conditions be locked?
The default transaction isolation level of InnoDB is repeatable read, so the following parts of this article that have no special instructions are set under the repeatable read isolation level.
What happens if only the line id = 5 is locked and the other lines are not locked.
Suppose a scenario:
Session a performs three queries, Q1, Q2 and Q3. Their SQL statements are the same, and they are all select * from t where d = 5 for update. You should be very clear about the meaning of this statement. Check all d = 5 lines, and use the current read and write lock.
What results will be returned by these three SQL statements.
- Q1 only returns the line id = 5;
- At T2, Session B changes the D value of the line id = 0 to 5, so at T3, Q2 finds the two lines id = 0 and id = 5;
- At T4, session C inserts another line (1,1,5), so at T5, Q3 finds the three lines of id = 0, id = 1 and id = 5.
The phenomenon that Q3 reads the line id = 1 is called “unreal reading”. In other words, unreal reading refers to that when a transaction queries the same range before and after two times, the latter query sees the row not seen in the previous query.
Explanation of “illusory reading”
- Under the level of repeatable read isolation, ordinary queries are snapshot reads, and the data inserted by other transactions will not be seen. Therefore, unreal reading will appear under “current reading”.
- The above modification result of Session B is seen by the select statement after session a with “current reading”, which cannot be called Unreal reading. Unreal reading only refers to “newly inserted line”.
If the transaction visibility rules are used to analyze, the return results of these three SQL statements are all OK.
Because for update is added to these three queries, they are all currently read. The current reading rule is to be able to read the latest values of all submitted records. In addition, the two statements of Session B and session C will be submitted after execution, so Q2 and Q3 should see the operation effect of these two transactions, and also see that this is not contradictory to the visibility rules of transactions, but there are still problems here.
Session a declared at T1, “I want to lock all d = 5 rows, and other transactions are not allowed to read and write.”. In fact, this semantics is broken.
Then add an SQL statement to Session B and session C, and see what happens.
The second statement of Session B, update t set C = 5 where id = 0, has the semantic meaning of “I changed the C value of id = 0 and d = 5 to 5”.
At T1, session a only locks the line id = 5, not the line id = 0. Therefore, Session B can execute these two update statements at T2. In this way, the lock declaration of Q1 statement in session a to lock all d = 5 rows is broken.
The same is true for session C. The change to the line id = 1 also destroys the lock declaration of Q1.
The problem of data consistency
Lock is designed to ensure the consistency of data. This consistency is not only the consistency of the internal data state of the database at the moment, but also the logical consistency of the data and logs.
To illustrate this problem, add an update statement to session a at T1, that is:
update t set d=100 where d=5。
The locking semantics of update and select For update is consistent, so it’s reasonable to add this update statement at this time. The statement of session a says “lock the statement with d = 5” just to update the data. If you look at the newly added up, there is no problem with the data. But let’s take a look at the content of binlog at this time.
The date statement changes the D value of the line it considers to be locked to 100.
Figure 3 shows the results in the database after execution.
- After T1, the line id = 5 becomes (5,5100). Of course, the result is finally submitted at T6;
- After T2, the line id = 0 becomes (0,5,5);
- After T4, there is one more line (1,5,5) in the table;
- The other rows are independent of the execution sequence and remain unchanged.
In this way, the data is OK, but let’s take a look at the content of binlog.
- At T2, Session B commits the transaction and writes two statements;
- At T4, session C commits the transaction and writes two statements;
- At T6, session a commits the transaction and writes the statement update t set D = 100 where d = 5.
If put together, it is:
update t set d=5 where id=0; /*(0,0,5)*/ update t set c=5 where id=0; /*(0,5,5)*/ insert into t values(1,1,5); /*(1,1,5)*/ update t set c=5 where id=1; /*(1,5,5)*/ Update t set D = 100 where d = 5; / * change d to 100 for all d = 5 rows*/
Whether the statement sequence is taken to the standby library for execution or later cloned with binlog, the results of these three lines become (0,5100), (1,5100) and (5,5100).
That is to say, the data of id = 0 and id = 1 are inconsistent. This is a serious problem.
How is this data inconsistency introduced?
Suppose that “select * from t where d = 5 for update” only locks the line d = 5, that is, the line id = 5.
So I think the above setting is unreasonable and needs to be changed.
Let’s add the write lock to all the lines encountered in the scanning process, and then look at the execution effect.
Since session a locks all the lines, Session B is locked when executing the first update statement. You need to wait until session a is submitted at T6 before session B can continue.
So for the row id = 0, the final result in the database is (0,5,5). In binlog, the execution sequence is as follows:
insert into t values(1,1,5); /*(1,1,5)*/ update t set c=5 where id=1; /*(1,5,5)*/ Update t set D = 100 where d = 5; / * change d to 100 for all d = 5 rows*/ update t set d=5 where id=0; /*(0,0,5)*/ update t set c=5 where id=0; /*(0,5,5)*/
Execute in log order, and the final result of the line id = 0 is (0,5,5). So, the problem with the line id = 0 is solved.
But at the same time, we can see that the result of id = 1 in the database is (1,5,5), while the result of binlog execution is (1,5100), which means that the problem of unreal reading has not been solved. Why have all the records been locked, or can’t the id = 1 line be inserted and updated?
The reason is simple. At T3, when we lock all rows, if the row id = 1 does not exist, it will not be locked.
In other words, even if all the records are locked, the newly inserted records cannot be stopped
The reason for unreal reading is that the row lock can only lock the row, but the “gap” between the records needs to be updated when the new record is inserted. Therefore, in order to solve the problem of unreal reading, InnoDB has to introduce a new lock, gap lock.
Gap lock, the lock is the gap between two values. For example, in the table t at the beginning of the article, 6 records are inserted in initialization, which results in 7 gaps.
When you execute select * from t where d = 5 for update, you not only add row locks to 6 existing records in the database, but also add 7 gap locks at the same time. This ensures that no new records can be inserted.
That is to say, in the process of scanning a line, not only the line is locked, but also the gap on both sides of the line is locked.
Data rows are entities that can be locked, and the gaps between data rows are also entities that can be locked. But the clearance lock is not the same as the lock we have met before.
For example, row lock is divided into read lock and write lock. The following figure shows the conflict relationship between these two types of row locks.
In other words, “another row lock” conflicts with row lock.
But the clearance lock is different,What conflicts with the clearance lock is the operation of “insert a record into the clearance”.There is no conflict between clearance locks.
Session B is not blocked here. Because there is no record of C = 7 in table t, session a adds clearance lock (5,10). And Session B is also a gap lock in this gap. They have a common goal, that is, to protect the gap and not allow the insertion of values. But there is no conflict between them.
Clearance lock and row lock are collectively called next key lock, and each next key lock is a front opening and back closing interval. That is to say, after the initialization of our table t, if we use select * from t for update to lock all records of the whole table, we will form seven next key locks, which are (- ∞, 0], (0,5), (5,10], (10,15), (15,20], (20,25), (25, + maximum).
Note: if there is no special explanation, we record the clearance lock as the open interval and the next key lock as the front open and back close interval.
Where does this supreme come from?
- This is because + ∞ is an open interval. In terms of implementation, InnoDB adds a nonexistent maximum value to each index, which is in line with what we said earlier, “all are front open and back closed intervals.”.
The introduction of gap lock and next key lock helps us solve the problem of unreal reading, but at the same time, it also brings some “troubles”.
Example: lock a line arbitrarily, insert it if it doesn’t exist, and update its data if it exists. The code is as follows:
begin; select * from t where id=N for update; /*If the line does not exist*/ insert into t values(N,N,N); /*If line exists*/ update t set d=N set id=N; commit;
Using insert On duplicate key update solves the problem that when there are multiple unique keys, this method cannot meet the requirements.
Once this logic has concurrency, it will encounter deadlock. Why is it the most strict mode to lock this logic with for update before each operation? How can there be deadlock?
Two sessions are used to simulate concurrency, and N = 9 is assumed.
In fact, there is no need to use the following UPDATE statement to form a deadlock. Let’s analyze it in the order of sentence execution
- Session a performs select For UPDATE statement, since the line id = 9 does not exist, a gap lock (5,10) will be added;
- Session B performs select For UPDATE statement, clearance lock (5,10) will also be added, and there will be no conflict between clearance locks, so this statement can be executed successfully;
- Session B tries to insert a row (9, 9, 9), but is blocked by the gap lock of session a, so it has to wait;
- Session a tries to insert a row (9, 9, 9) and is blocked by the gap lock of session B.
The two sessions are waiting for each other to form a deadlock. Of course, the deadlock detection of InnoDB immediately finds out the deadlock relationship, which makes the insert statement of session a report an error and return.
The introduction of gap lock may cause the same statement to lock in a larger scope, which actually affects the concurrency。
If the isolation level is set to read commit, there is no gap lock. But at the same time, you need to set the binlog format to row to solve the possible inconsistency between data and log. This is also the configuration combination used by many companies.
Using read commit isolation level plus binlog_ The combination of format = row, if the isolation level of read and submit is enough, that is to say, the business does not need the guarantee of repeatable read, so considering that the lock range of operation data under read and submit is smaller (no gap lock), this choice is reasonable.
- Later versions of MySQL may change the locking policy, so this rule is only limited to the latest version up to now, that is, 5. X Series < = 5.7.24, 8.0 series < = 8.0.13.
Because the gap lock is only valid under the repeatable read isolation level, if there is no special description, the default is the repeatable read isolation level.
- Principle 1: the basic unit of locking is next key lock. I hope you remember that the next key lock is the front opening and back closing interval.
- Principle 2: only the objects accessed in the search process will be locked.
- Optimization 1: for the equivalent query on the index, when locking the unique index, the next key lock degenerates to row lock.
- Optimization 2: for the equivalent query on the index, when traversing to the right and the last value does not meet the equivalence condition, the next key lock degenerates into a gap lock.
- A bug: a range query on a unique index will access the first value that does not meet the condition.
The table creation statement and initialization statement of table t are as follows:
CREATE TABLE `t` ( `id` int(11) NOT NULL, `c` int(11) DEFAULT NULL, `d` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `c` (`c`) ) ENGINE=InnoDB; insert into t values(0,0,0),(5,5,5), (10,10,10),(15,15,15),(20,20,20),(25,25,25);
Equivalent condition operation clearance:
Since there is no record with id = 7 in table t, use the above-mentioned locking rules to judge:
- According to principle 1, the locking unit is next key lock, and the locking range of session a is (5,10];
- At the same time, according to optimization 2, this is an equivalent query (id = 7), while id = 10 does not meet the query conditions, and the next key lock degenerates into a gap lock, so the final locking range is (5,10).
Therefore, if Session B wants to insert a record with id = 8 into this gap, it will be locked, but session C can modify the line with id = 10.
Override lock on Index:
Here session a adds a read lock to the row with C = 5 on index C.
- According to principle 1, the locking unit is next key lock, so (0,5] is added with next key lock.
- Note that C is an ordinary index, so you can’t stop right away only when you access the record of C = 5. You need to traverse to the right and give up when you find C = 10. According to principle 2, all accesses must be locked, so add next key lock to (5,10].
- But at the same time, this is in line with optimization 2: equivalent judgment, traversal to the right, the last value does not meet the equivalent condition of C = 5, so it degenerates into clearance lock (5,10).
- According to principle 2,Only the accessed objects are lockedThis query uses the overlay index and does not need to access the primary key index, so there is no lock on the primary key index. This is why the update statement of Session B can be executed.
However, if session C wants to insert a (7,7,7) record, it will be locked by the clearance lock (5,10) of session a.
It should be noted that in this example, lock in share mode only covers the index, but for update is different. When you execute for update, the system will think that you want to update the data next, so it will add an uplink lock to the qualified rows on the primary key index by the way.
This example shows that the lock is added to the index. At the same time, the guidance it gives us is that if you want to use lock in share mode to add a read lock to a row to avoid data being updated, you must bypass the optimization of the overlay index and add a segment that does not exist in the index to the query field. For example, change the query statement of session a to select d from t where C = 5 lock in share mode.
For table t, do the following two query statements have the same locking range?
mysql> select * from t where id=10 for update; mysql> select * from t where id>=10 and id<11 for update;
Logically, these two query statements must be equivalent, but their locking rules are different. Now let session a execute the second query to see the locking effect.
Use the locking rules mentioned above to analyze what locks session a will add?
- At the beginning of execution, you need to find the first line with id = 10, so it should be next key lock (5,10). According to optimization 1, the equivalence condition on primary key ID degenerates to row lock, and only the row lock of id = 10 is added.
- After searching the range, we continue to look for the line id = 15 and stop. Therefore, we need to add next key lock (10,15).
Therefore, the scope of session a lock is the primary key index, row lock id = 10 and next key lock (10,15).
It should be noted that the first time session a locates a row with id = 10, it is judged as an equivalent query, while when it scans to the right with id = 15, it is judged by a range query.
Compare case 3 with two examples of range query locking.
It should be noted that different from case 3, case 4 uses field C in the where part of the query statement.
This time session a uses field C to judge. The only difference between the locking rule and case 3 is that when the next key lock (5,10) is added to index C when C = 10 is used to locate the record for the first time, because Index C is not a unique index, there is no optimization rule, that is, it will not degenerate into row lock. Therefore, the final locks added by session a are (5,10] and (10,10) on index C, 15] These two next key locks.
So from the result, sesson B is blocked when it wants to insert the insert statement of (8,8,8).
It’s reasonable to stop scanning until C = 15, because InnoDB needs to scan until C = 15 to know that it doesn’t need to continue searching.
In the previous four cases, two principles and two optimizations in locking rules are used. Next, let’s look at a case about bugs in locking rules.
Session a is a range query. According to principle 1, only the next key lock (10,15) should be added to the index ID. since ID is the only key, the loop judges that the line id = 15 should be stopped.
But in implementation, InnoDB will scan forward to the first behavior that does not meet the condition, that is, id = 20. Moreover, since this is a range scan, the next key lock (15,20) on the index ID will also be locked.
Therefore, if Session B wants to update the line id = 20, it will be locked. Similarly, if session C wants to insert a line with id = 16, it will be locked.
In principle, it is unnecessary to lock the line id = 20 here. Since id = 15 is scanned, you can make sure you don’t need to look for it later. But the implementation did, so it’s a bug.
It is also mentioned in the official bug system, but it has not been verified.
Insert a new record into table t.
mysql> insert into t values(30,10,30);
The newly inserted row C = 10 means that there are two rows with C = 10 in the table. So, what is the state of the gap on Index C? You know, because the non unique index contains the value of the primary key, it is impossible to have two “same” rows.
You can see that although there are two C = 10 records, their primary key ID values are different (10 and 30 respectively), so there is a gap between the two C = 10 records.
The primary key ID on Index C is shown. In order to distinguish from the open interval form of clearance lock, I use the form (C = 10, id = 30) to represent a row on the index.
- This time, use the delete statement to verify. Note that the logic of the delete statement is actually the same as that of the select statement For update is similar, that is, two “principles”, two “optimizations” and one “bug” I summarized at the beginning of this article.
At this time, session a first accesses the first record with C = 10 when traversing. Similarly, according to principle 1, the next key lock (C = 5, id = 5) to (C = 10, id = 10) is added here.
Then, session a looks to the right until it encounters the line (C = 15, id = 15), and the loop ends. According to optimization 2, this is an equivalent query, which finds the row that does not meet the condition to the right, so it will degenerate into a gap lock from (C = 10, id = 10) to (C = 15, id = 15).
In other words, the lock range of the delete statement on Index C is the part covered by the blue area in the figure below.
The blue area is dotted on the left and right sides, which means that there is no lock on the two lines (C = 5, id = 5) and (C = 15, id = 15).
Control case 6 scenario:
In this example, limit 2 is added to the delete statement of session a. As you know, there are only two records with C = 10 in table t, so the effect of deletion is the same with or without limit 2, but the effect of locking is different. As you can see, the insert statement of Session B has passed, which is different from the result of case 6.
Because: the delete statement in case 7 is explicitly limited by limit 2, so after traversing the line (C = 10, id = 30), there are two statements that meet the conditions, and the loop ends.
Therefore, the locking range of Index C is changed from (C = 5, id = 5) to (C = 10, id = 30), as shown in the following figure
It can be seen that the gap after (C = 10, id = 30) is not in the locking range, so the insert statement can be executed successfully by inserting C = 12.
Try to add limit when deleting data. This can not only control the number of deleted data, make the operation more secure, but also reduce the scope of locking.
In the previous example, the analysis is based on the next key lock logic, because it is more convenient. Finally, let’s look at a case to illustrate that next key lock is actually the result of the combination of gap lock and row lock.
Now, let’s analyze why this is the case in chronological order.
- After session a starts the transaction, it executes the query statement and adds lock in share mode. Next key lock (5,10) and gap lock (10,15) are added to index C;
- The update statement of Session B should also add next key lock (5,10) to index C to enter lock waiting;
- Then session a needs to insert the line (8,8,8) again, which is locked by the gap lock of session B. Because of a deadlock, InnoDB rolls back session B.
Hasn’t the next key lock of Session B been successfully applied yet?
- The “add next key lock (5,10)” operation of Session B is actually divided into two steps: first, add the gap lock of (5,10), and the lock is successful; then, add the row lock of C = 10, and it is locked at this time.
- In other words, we can use next key lock to analyze the locking rules. But you should know that the specific implementation is to be divided into clearance lock and row lock.
In the execution sequence of the following figure, why is the insert statement of Session B blocked.
Analyze the locking rules to see which locks are added to the select statement of session a
- Since it is order by C DESC, the first row to be located is the row with “rightmost” C = 20 on Index C, so clearance lock (20,25) and next key lock (15,20) will be added.
- Traverse the index C to the left, and stop until C = 10, so the next key lock will be added to (5,10], which is the reason for blocking the insert statement of session B.
- In the scanning process, there are values in the three lines C = 20, C = 15 and C = 10. Because they are select *, three row locks will be added to the primary key ID.
Therefore, the scope of the select statement lock of session a is:
- Index C (5,25);
- There are two row locks id = 15 and 20 on the primary key index.
< = clearance lock or row lock?
Analysis with “execution process”. When InnoDB wants to find the “first value”, it is based on the equivalence rule. After finding the first value, it needs to find the “next value” in the index, which corresponds to the range search in our rule.
The normal short connection mode is to disconnect after connecting to the database and executing very few SQL statements, and then reconnecting when necessary next time. If you use a short connection, in the peak period of business, the number of connections may suddenly soar.
The cost of MySQL connection is very high. In addition to the normal network connection handshake three times, also need to do login permission judgment and get the data read and write permissions of this connection.
When the database pressure is relatively small, these additional costs are not obvious.
However, there is a risk in the short connection model, that is, once the database processing is slower, the number of connections will soar. max_ The connections parameter is used to control the upper limit of the number of simultaneous connections of a MySQL instance. If this value is exceeded, the system will reject the next connection request and report an error and prompt “too many connections”. For requests that are denied connections, from a business point of view, the database is not available.
When the machine load is high, the processing time of existing requests becomes longer, and each connection lasts longer. At this time, if there is a new connection, it may exceed max_ The limitations of connections.
In this case, a more natural idea is to increase max_ The value of connections. But there are risks. Because the design is max_ The purpose of the connections parameter is to protect mysql. If we change it too large, so that more connections can come in, the system load may be further increased, and a large amount of resources will be consumed in logic such as permission verification. The result may be just the opposite. The thread that has been connected cannot get CPU resources to execute business SQL requests.
max_ The calculation of connections is not to see who is running, but to occupy a counting position as long as it is connected. For those connections that don’t need to be maintained, we can actively kick them out through kill connection. This behavior is similar to setting wait in advance_ The effect of timeout is the same. Set wait_ The timeout parameter indicates that a thread is idle and waiting_ After so many seconds of timeout, it will be disconnected directly by mysql.
However, it should be noted that in the result of show processlist, it may be harmful to kick out the thread displayed as sleep.
In the above example, if session a is disconnected, because session a has not been submitted at this time, MySQL can only process it as a rollback transaction; however, if Session B is disconnected, it has no great impact. Therefore, according to the priority, you should give priority to the disconnection of transactions such as session B.
However, how to judge which affairs are idle? Session C executes show processlist 30 seconds after time T. the result is as follows.
In the figure, the two sessions with id = 4 and id = 5 are in sleep state. If you want to see the specific status of the transaction, you can check the information_ InnoDB of schema Library_ TRX table.
In this result, TRX_ mysql_ thread_ Id = 4 indicates that the thread with id = 4 is still in the transaction.
Therefore, if the number of connections is too many, you can first disconnect the connections that have been idle for too long in the transaction; if this is not enough, consider disconnecting the connections that have been idle for too long in the transaction.
The command to disconnect from the server is kill connection + ID. when a client is in sleep state, its connection will not be known immediately after the server actively disconnects. Until the client initiates the next request, it will receive such an error “error 2013 (HY000): lost connection to MySQL server during query”.
It may be harmful to disconnect from the database actively, especially when some applications do not reconnect after receiving this error, but directly use the handle that can no longer be used to try the query again. This will make it appear from the application side that “MySQL has not been restored”.
If you are a business enabled DBA, don’t assume that all application code will be handled correctly. Even if it’s just a disconnect, make sure the business development team is informed.
Some business codes will first apply for a large number of database connections for standby in a short time. If the database is confirmed to be suspended by the connection behavior, then one possible way is to let the database skip the permission verification stage.
The way to skip permission verification is to restart the database and start with the – skip grant tables parameter. In this way, the entire MySQL will skip all the permission verification stages, including the connection process and statement execution process.
However, this method is particularly in line with our title of “drinking poison to quench thirst”, which is highly risky and is not recommended. Especially if you can access the Internet outside the library, you can’t do it.
In MySQL 8.0, if you enable the – skip grant tables parameter, MySQL will turn on the – skip networking parameter by default, which means that the database can only be connected by local clients. It can be seen that MySQL officials also attach great importance to the security of skip grant tables.
In addition to the performance problems that may be caused by the explosion of short connections, in fact, more performance problems are caused by queries or update statements. Among them, there are two typical types of query problems, one is caused by the new slow query, the other is caused by the sudden increase of QPS (queries per second).
In mysql, there are three possibilities for slow queries that may cause performance problems
- The index is not well designed;
- The SQL statement is not well written;
- MySQL chose the wrong index.
This kind of scenario is usually solved by creating an index urgently. After MySQL version 5.6, index creation supports online DDL. For the situation that the database has been suspended by this statement during the peak period, the most efficient way is to directly execute the ALTER TABLE statement.
Ideally, it can be executed first in the standby database. Suppose that your current service is one primary and one standby, with primary database a and standby database B. the general process of this scheme is as follows:
- Execute set SQL on standby database B_ log_ Bin = off, that is, do not write binlog, and then execute the ALTER TABLE statement with index;
- Perform active standby switching;
- At this time, the main library is B and the standby library is a. Execute set SQL on a_ log_ Bin = off, and then execute the ALTER TABLE statement with the index.
This is an “old” DDL scheme. Usually, when making changes, you should consider a scheme like GH OST, which is more secure. However, when emergency treatment is needed, the efficiency of the above scheme is the highest.
We can do this by rewriting the SQL statement. MySQL 5.7 provides query_ Rewrite function, can input a statement into another mode.
For example, if the statement is wrongly written as select * from t where id + 1 = 10000, you can add a statement rewriting rule in the following way.
mysql> insert into query_rewrite.rewrite_rules(pattern, replacement, pattern_database) values ("select * from t where id + 1 = ?", "select * from t where id = ? - 1", "db1"); call query_rewrite.flush_rewrite_rules();
Here, call query_ rewrite.flush_ rewrite_ The stored procedure of rules () is to make the inserted new rules take effect, which is also called “query rewriting”. You can use the method in Figure 4 to confirm whether the rewriting rule is effective.
At this time, the emergency plan is to add force index to this statement.
Similarly, using the query rewriting function and adding force index to the original statement can also solve this problem.
Among the three possible performance problems caused by slow queries that I discussed with you above, the most common ones are the first two, namely, poorly designed indexes and poorly written statements. These two situations can be completely avoided. For example, through the following process, we can find problems in advance.
- Before going online, in the test environment, open the slow log and open the long log_ query_ Time is set to 0 to ensure that every statement will be recorded in the slow query log;
- Insert the data on the analog line into the test table and do a regression test;
- Observe the output of each type of statement in slow query log, especially pay attention to rows_ Whether the examined field is as expected. (we have used rows many times in the previous article_ I believe you have tried the examined method. If you still don’t understand, please leave me a message and let’s discuss it together).
If there are not many new SQL statements, just run them manually. If it is a new project, or modify the table structure design of the original project, full regression testing is necessary. At this time, you need tools to help you check the return results of all SQL statements. For example, you can use the open source tool Pt query digest（www.percona.com/doc/percona-toolki…)。
Sometimes, the QPS of a certain statement suddenly soars due to a sudden business peak or application bug, which may also lead to excessive MySQL pressure and affect the service.
If it is caused by a bug in the new function. Of course, the most ideal situation is to let the business drop this function, and the service will naturally recover.
If the next function is processed from the database side, there are different methods available corresponding to different backgrounds.
- One is caused by new business bugs. Suppose your DB operation and maintenance is relatively standardized, that is to say, the white list is added one by one. In this case, if you can be sure that the business side will drop this function, but the time is not so fast, you can directly remove the white list from the database side.
- If the new function uses a separate database user, you can delete the user with the administrator account, and then disconnect the existing connection. In this way, if the connection of this new function is not successful, the QPS caused by it will become 0.
- If the new function is deployed with the main function, we can only limit it by processing statements. At this time, we can use the query rewriting function mentioned above to rewrite the most stressed SQL statement directly as “select 1”.
Of course, the risk of this operation is very high, you need to be very careful. It may have two side effects
- If the SQL statement template is also used in other functions, there will be mischief;
- Many businesses do not rely on this statement to complete the logic, so if this statement is returned as the result of select 1, the subsequent business logic may fail together.
Therefore, scheme 3 is used for hemostasis. Like the above-mentioned scheme of removing permission verification, it should be the lowest priority among all options.
At the same time, it will be found that both schemes 1 and 2 depend on the standardized operation and maintenance system: virtualization, white list mechanism and business account separation. Thus, more preparation often means a more stable system.
As long as redo log and binlog are persistent to disk, the data can be recovered after MySQL is restarted abnormally.
In fact, the writing logic of binlog is relatively simple: during the execution of a transaction, the log is first written to the binlog cache, and then the binlog cache is written to the binlog file when the transaction is submitted.
The binlog of a transaction cannot be disassembled. Therefore, no matter how large the transaction is, it should be written once. This involves the storage of binlog cache.
The system allocates a piece of memory to the binlog cache, one for each thread, and the parameter binlog_ cache_ Size is used to control the memory occupied by binlog cache in a single thread. If it exceeds the size specified by this parameter, it will be saved to disk.
When a transaction is committed, the executor writes the complete transaction in the binlog cache to the binlog and clears the binlog cache. The status is shown in Figure 1.
As you can see, each thread has its own binlog cache, but shares the same binlog file.
- Write in the figure refers to writing the log to the page cache of the file system. It does not persist the data to disk, so it is faster.
- Fsync in the figure is the operation of persisting data to disk. In general, we think that fsync takes up the IOPs of the disk.
The timing of write and fsync is determined by the parameter sync_ Binlog controlled:
- sync_ When binlog = 0, it means that every time a transaction is submitted, only write is used instead of fsync;
- sync_ When binlog = 1, fsync will be executed every time a transaction is committed;
- sync_ When binlog = n (n > 1), it means that the transaction is written every time it is submitted, but fsync does not occur until n transactions are accumulated.
Therefore, in the scenario of IO bottleneck, sync is used_ Setting binlog to a larger value can improve performance. In actual business scenarios, considering the controllability of the amount of lost logs, it is generally not recommended to set this parameter to 0. It is more common to set it to a value between 100 and 1000.
However, the sync_ If binlog is set to N, the corresponding risk is that if the host is restarted abnormally, the binlog logs of the latest n transactions will be lost.
During the execution of a transaction, the generated redo log is written to the redo log buffer first.
The content in the redo log buffer does not need to be persisted to disk after each generation.
If MySQL is restarted abnormally during transaction execution, this part of the log will be lost. Since the transaction is not committed, there is no loss when the log is lost.
Before a transaction is committed, some logs in the redo log buffer may be persisted to disk
Starting from three possible states of redo log. These three states correspond to the three color blocks in Figure 2.
The three states are as follows:
- It exists in the redo log buffer. Physically, it is in the MySQL process memory, which is the red part of the graph;
- Write to disk, but no fsync. Physically, it is in the page cache of the file system, which is the yellow part in the figure;
- Persistent to disk, corresponding to the hard disk, which is the green part in the figure.
The three states are as follows:
- It exists in the redo log buffer. Physically, it is in the MySQL process memory, which is the red part of the graph;
- Write to disk, but no fsync. Physically, it is in the page cache of the file system, which is the yellow part in the figure;
- Persistent to disk, corresponding to the hard disk, which is the green part in the figure.
It’s fast to write logs to redo log buffer, and it’s almost the same from wirte to page cache, but it’s much slower to persist logs to disk.
In order to control the write strategy of redo log, InnoDB provides InnoDB_ flush_ log_ at_ trx_ The commit parameter has three possible values:
- When it is set to 0, it means that the redo log is only left in the redo log buffer every time a transaction is submitted;
- When it is set to 1, it means that the redo log is directly persisted to disk every time a transaction is submitted;
- When set to 2, the redo log is only written to page cache every time a transaction is submitted.
InnoDB has a background thread. Every second, it writes the log from redo log buffer to the page cache of the file system, and then calls fsync to persist to disk.
Note that the redo logs of the intermediate process of transaction execution are also written directly in the redo log buffer, and these redo logs will also be persisted to disk by the background thread. In other words, the redo log of a transaction that has not been committed may have been persisted to disk.
In fact, in addition to the polling operation of the background thread once per second, there are two scenarios in which the redo log of an uncommitted transaction is written to disk.
- One is that the space occupied by redo log buffer is about to reach InnoDB_ log_ buffer_ Half of the time, the background thread will take the initiative to write disk.Note that since the transaction has not been committed, the write action is just write, instead of calling fsync, that is, it is only left in the page cache of the file system.
- The other is to persist the redo log buffer of a parallel transaction to disk.Suppose a transaction a is half executed and some redo logs have been written to the buffer. At this time, another thread’s transaction B commits. If InnoDB_ flush_ log_ at_ trx_ If commit is set to 1, transaction B will persist all the logs in redo log buffer to disk according to the logic of this parameter. At this time, the redo log buffer of transaction a will be persisted to disk.
In terms of timing, redo log is prepared first, then binlog is written, and finally redo log is committed.
If InnoDB_ flush_ log_ at_ trx_ If commit is set to 1, the redo log will be persisted once in the Prepare phase, because a crash recovery logic depends on the prepare redo log and binlog to recover.
With one background polling per second and the logic of crash recovery, InnoDB thinks that the redo log does not need fsync when it is committed, and only writes to the page cache of the file system.
Generally speaking, the “double 1” configuration of MySQL refers to sync_ Binlog and InnoDB_ flush_ log_ at_ trx_ Commit is set to 1. In other words, before a transaction is fully committed, it needs to wait for two disk swipes, one is redo log (Prepare phase), and the other is binlog.
At this time, there is a question, which means that if the TPS I see from MySQL is 20000 per second, it will write 40000 disks per second. However, I use tools to test that the disk capacity is only about 20000. How can I achieve 20000 TPS?
At this time, we need to use the group commit mechanism.
First of all, we need to introduce the concept of log sequence number (LSN). LSN is monotonically increasing, which is used to correspond to write points of redo log. Each time a redo log of length is written, the value of LSN is added with length.
LSN will also be written to the InnoDB data page to ensure that the data page will not be repeated redo log multiple times.
As shown in Figure 3, three concurrent transactions (Trx1, TRX2, trx3) are persistent to disk after writing redo log buffer in the Prepare phase. The corresponding LSNS are 50, 120 and 160 respectively.
As you can see from the figure,
- Trx1 is the first to arrive and will be selected as the leader of this group;
- When Trx1 starts to write disks, there are already three transactions in this group. At this time, the LSN becomes 160;
- When Trx1 writes to disk, it carries LSN = 160. Therefore, when Trx1 returns, all redo logs with LSN less than or equal to 160 have been persisted to disk;
- At this time, TRX2 and trx3 can return directly.
Therefore, in a group submission, the more team members, the better the effect of saving disk IOPs. However, if there is only one thread pressure test, then you can only honestly make one transaction correspond to one persistent operation.
In the concurrent update scenario, after the first transaction has written the redo log buffer, the later the fsync is called, the more team members there may be, and the better the IOPs saving effect will be.
In order to make the fsync band more members, MySQL has a very interesting Optimization: procrastination.
In the figure, “write binlog” is taken as an action. But actually, writing binlog is divided into two steps
- First, write binlog from binlog cache to binlog file on disk;
- Call fsync persistence.
In order to make the group submit better, MySQL delays the time of redo log as fsync to step 1. That is to say, the figure above becomes like this:
In this way, binlog can also be submitted by group. When executing step 4 in Figure 5 to add binlog fsync to disk, if the binlogs of multiple transactions have been written, they are also persisted together, which can also reduce the consumption of IOPs.
However, the third step usually runs very fast, so the interval between binlog write and fsync is short, resulting in fewer binlogs that can be aggregated and persisted together. Therefore, the effect of group submission of binlog is usually not as good as that of redo log.
If you want to improve the effect of binlog group submission, you can set binlog_ group_ commit_ sync_ Delay and binlog_ group_ commit_ sync_ no_ delay_ Count.
- binlog_ group_ commit_ sync_ The delay parameter indicates how many microseconds to delay before calling fsync;
- binlog_ group_ commit_ sync_ no_ delay_ The count parameter indicates how many times to call fsync.
The relationship between these two conditions is or, that is to say, fsync will be called as long as one of them satisfies the condition.
So, when binlog_ group_ commit_ sync_ When delay is set to 0, binlog_ group_ commit_ sync_ no_ delay_ Count doesn’t work.
The wal mechanism is to reduce disk writes, but every time a transaction is submitted, redo log and binlog are written, and the number of disk reads and writes is not reduced?
The wal mechanism mainly benefits from two aspects
- Redo log and binlog are written in sequence, and the speed of disk write in sequence is faster than that of random write;
- The consumption of IOPs disk group can be greatly reduced.
If there is a performance bottleneck in your MySQL, and the bottleneck is Io, what methods can be used to improve the performance?
To solve this problem, we can consider the following three methods:
- Set binlog_ group_ commit_ sync_ Delay and binlog_ group_ commit_ sync_ no_ delay_ Count parameter to reduce the number of binlog disk writes. This method is based on “extra intentional waiting”, so it may increase the response time of the statement, but there is no risk of data loss.
- Will sync_ Binlog is set to a value greater than 1 (100-1000 is more common). The risk is that the binlog log will be lost when the host is powered down.
- The InnoDB_ flush_ log_ at_ trx_ Commit is set to 2. The risk is that the host will lose data when it loses power.
It is not recommended that you use InnoDB_ flush_ log_ at_ trx_ Commit is set to 0. Setting this parameter to 0 means that the redo log is only stored in memory. In this way, if MySQL is restarted abnormally, it will lose data, which is too risky. The speed of redo log writing to the page cache of the file system is also very fast, so setting this parameter to 2 is similar to setting it to 0. In fact, the performance is almost the same, but in this way, when MySQL is restarted abnormally, data will not be lost, and the risk will be smaller.
Question 1:After executing an update statement, I’ll execute the hexdump command to directly view the contents of the IBD file. Why don’t I see any changes in the data?
Answer: this may be due to the wal mechanism. After the update statement is executed, InnoDB only guarantees to write the redo log and memory, and may not have time to write the data to disk.
Question 2:Why is the binlog cache maintained by each thread and the redo log buffer shared globally?
Answer: the main reason MySQL is designed like this is that binlog cannot be “interrupted”. The binlog of a transaction must be written continuously. Therefore, after the completion of the whole transaction, it should be written to the file together.
The redo log does not have this requirement. The generated logs can be written to the redo log buffer. The contents in the redo log buffer can also be “free riding”, and other transactions can be written to disk together when they are submitted.
Question 3:During the execution of a transaction, it is not yet in the commit phase. If a crash occurs, the redo log will be lost. Will this lead to inconsistency between the active and standby?
Answer: No. Because the binlog is still in the binlog cache at this time, it is not sent to the standby database. After crash, there is no redo log and binlog. From a business point of view, the transaction is not committed, so the data is consistent.
Question 4:If a crash occurs after the binlog has finished writing the disk, it will restart before giving a reply to the client. When the client reconnects, it is found that the transaction has been submitted successfully. Is this a bug?
Imagine a more extreme situation. The whole transaction is committed successfully, redo log commit is completed, and the standby database receives the binlog and executes it. However, the network of the main database and the client is disconnected, which causes the successful transaction packet to not return. At this time, the client will also receive the exception of “network disconnection”. This can only be regarded as a successful transaction, not a bug.
In fact, the crash safe guarantee of database is as follows:
- If the client receives the message of successful transaction, the transaction must be persistent;
- If the client receives the message of transaction failure (such as primary key conflict, rollback, etc.), the transaction must fail;
- If the client receives the message of “execution exception”, the application needs to query the current state to continue the subsequent logic after reconnection. At this time, the database only needs to ensure internal consistency (between data and log, between primary database and standby database).
Question 5:When will the online production library be set to “non double 1”?
- Business peak. Generally, if there is a predicted peak period, the DBA will have a plan to set the main database to “non double 1”.
- In order to make the standby database catch up with the main database as soon as possible. @Eternal memory and @ second sight mention this scene.
- The process of restoring a copy of the main database by backup and applying binlog is similar to the same scenario.
- When importing data in batches.
In general, to change the production library to “non dual 1” configuration is to set InnoDB_ flush_ logs_ at_ trx_ commit=2、sync_ binlog=1000。
- Because binlog is set from the library_ group_ commit_ sync_ Delay and binlog_ group_ commit_ sync_ no_ delay_ Count results in a situation where there has been a delay. We set these two parameters in the main database to reduce the disk writing pressure of binlog. When the standby database is set like this, especially when it is “about to catch up”, it will be affected by these two parameters. In general, the “non double 1” is used to catch up with the main equipment.
- In sync_ If binlog = 0, set sync_ Delay and sync_ no_ delay_ The phenomenon of count, like the awareness of discovering boundaries and the good habit of manual verification. It’s like this: sync_ Delay and sync_ no_ delay_ The logic of count goes first, so we will still wait. When one of these two conditions is met, you enter sync_ Binlog phase. At this time, if you judge sync_ If binlog = 0, skip directly, or do not adjust fsync.
- Set up sync_ When binlog = 0, you can still see that the binlog file has been modified immediately. This is right. When we say “page cache is written”, it is the page cache of the file system. What you see with the LS command is the result returned by the file system.
Main / standby switching process:
In state 1, the reading and writing of the client directly access node a, while node B is the backup database of a, only synchronizes the updates of a and executes them locally. This keeps the data of nodes B and a the same.
When you need to switch, cut into state 2. At this time, the client reads and writes to node B, and node a is the backup database of B.
In state 1, although node B is not directly accessed, I still recommend that you set node B (the standby Library) to read only mode. In doing so, there are the following considerations:
- Sometimes, some query statements of operation class will be put to the standby database for query, which can be set as read-only to prevent misoperation;
- To prevent the switch logic from having bugs, such as double write in the process of switch, resulting in the inconsistency between master and standby;
- The readonly state can be used to determine the role of the node.
If the standby database is set to read-only, how can it keep up with the master database?
- The readonly setting is invalid for users with super permissions, while the thread used for synchronous update has super permissions.
Let’s seeWhat is the internal flow of the line from node a to B. Figure 2 shows the complete flow chart of an update statement executed at node A and then synchronized to node B.
As you can see in Figure 2, after receiving the update request from the client, the main database executes the update logic of internal transactions and writes binlog at the same time.
A long connection is maintained between standby database B and primary database a. There is a thread inside the main library a, which is specially used to serve the long connection of the standby library B. A complete process of transaction log synchronization is as follows:
- On the standby database B, set the IP, port, user name and password of the primary database a through the change master command, and the location from which to request binlog. This location contains the file name and log offset.
- Execute the start slave command on standby database B. at this time, the standby database will start two threads, which are IO in the figure_ Thread and SQL_ thread。 Among them, IO_ Thread is responsible for establishing a connection with the main library.
- After verifying the user name and password, the main database a starts to read the binlog locally according to the location passed by the standby database B and send it to B.
- After receiving the binlog, the standby database B writes it to the local file, which is called the relay log.
- sql_ Thread reads the transit log, parses the commands in the log, and executes them.
It needs to be explained here that due to the introduction of multi-threaded replication scheme, SQL_ Threads evolved into multiple threads.
One is statement and the other is row. The third format, called mixed, is a mixture of the first two formats.
Create a table and initialize several rows of data:
mysql> CREATE TABLE `t` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `t_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `a` (`a`), KEY `t_modified`(`t_modified`) ) ENGINE=InnoDB; insert into t values(1,1,'2018-11-13'); insert into t values(2,2,'2018-11-12'); insert into t values(3,3,'2018-11-11'); insert into t values(4,4,'2018-11-10'); insert into t values(5,5,'2018-11-09');
If you want to delete a row of data in the table, let’s see how the binlog of the delete statement is recorded.
Note that the following statement contains comments. If you use the MySQL client to do this experiment, remember to add the – C parameter, otherwise the client will automatically remove the comments.
mysql> delete from t /*comment*/ where a>=4 and t_modified<='2018-11-10' limit 1;
When binlog_ When format = statement, the original SQL statement is recorded in binlog. You can use it
mysql> show binlog events in 'master.000001';
Command to see the contents of binlog
Take a look at the output in Figure 3.
- First line [email protected]@ SESSION.GTID_ NEXT=’ANONYMOUS’；
- The second line is a begin, which corresponds to the commit in the fourth line, indicating that there is a transaction in the middle;
- The third line is the statement that is actually executed. You can see that before the delete command actually executed, there is a “use ‘test” command. This command is not executed on our own initiative, but is added by MySQL according to the database of the table to be operated. This can ensure that when the log is sent to the standby database for execution, no matter which database the current working thread is in, it can be correctly updated to the table t of the test library.
The delete statement after the use ‘test’ command is the original SQL we input. As you can see, binlog “faithfully” records SQL commands, even comments.
- The last line is a commit. You can see it says XID = 61.
To illustrate the difference between statement and row format, let’s take a look at the execution effect of this delete command
As you can see, running this delete command produces a warning. The reason is that the current binlog is set in the statement format, and there is a limit in the statement, so this command may be unsafe.
Why do you say that? This is because there is a limit in delete, which may lead to the inconsistency between the active and standby data. For example, the above example:
- If index a is used in the delete statement, the first row satisfying the condition will be found according to index a, that is, the row a = 4 will be deleted;
- But if the index t is used_ Modified, then t is deleted_ Modified =’2018-11-09 ‘, which is the line a = 5.
In the statement format, the original statement is recorded in the binlog, so there may be such a situation: when the main database executes the SQL statement, index a is used; when the standby database executes the SQL statement, index t is used_ modified。 Therefore, MySQL thinks that it is risky to write like this.
Then, if the format of binlog is changed to binlog_ Format =’row ‘, is there no such problem? Let’s take a look at the content in the binog.
As you can see, compared with binlog in statement format, begin and commit before and after are the same. However, there is no original SQL statement in the binlog of row format. Instead, it is replaced by two events: table_ Map and delete_ rows。
- Table_ Map event, which is used to show that the next table to be operated is table t of test library;
- Delete_ Rows event, which is used to define the deletion behavior.
In fact, you can’t see the details from Figure 5. You need to use the mysqlbinlog tool to parse and view the contents of binlog with the following command. Because the information in Figure 5 shows that the binlog of this transaction starts at 8900, you can use the start position parameter to specify that parsing starts from the log at this location.
mysqlbinlog -vv data/master.000001 --start-position=8900;
From this figure, we can see the following information:
- Server ID 1 indicates that the transaction is in server_ Id = 1.
- Every event has the value of CRC32, because I set the parameter binlog_ Checksum is set to CRC32.
- Table_ Map event is the same as you can see in Figure 5, showing the next table to open, map to 226. Now we only operate one table in this SQL statement. What if we want to operate multiple tables? Each table has a corresponding table_ Map event, map to a separate number, used to distinguish the operation of different tables.
- In the mysqlbinlog command, we use the – VV parameter to parse out all the contents, so we can see the values of each field from the results (for example, @ 1 = 4, @ 2 = 4).
- binlog_ row_ The default configuration of image is full, so delete_ Event contains the values of all fields in the deleted row. If the binlog_ row_ If image is set to minimum, only necessary information will be recorded. In this case, only id = 4 will be recorded.
- The last XID event is used to indicate that the transaction was committed correctly.
When binlog_ When format uses row format, binlog records the primary key ID of the real deleted row. When binlog is transferred to the standby database, the row with id = 4 will be deleted. There will be no problem of deleting different rows between the active and standby databases.
Why does mixed exist in binlog format?Inference process:
- Because some binlogs in the statement format may cause inconsistency between the active and the standby, the row format should be used.
- But the disadvantage of row format is that it takes up a lot of space. For example, you use a delete statement to delete 100000 rows of data. If you use statement, an SQL statement is recorded in the binlog, which takes up dozens of bytes of space. But if we use binlog in row format, we need to write all the 100000 records to binlog. This will not only take up more space, but also consume IO resources to write binlog, which will affect the execution speed.
- Therefore, MySQL takes a compromise, that is, binlog in mixed format. The mixed format means that MySQL will determine whether this SQL statement may cause inconsistency between the active and the standby. If possible, use row format, otherwise use statement format.
That is to say, mixed format can take advantage of the advantages of statement format and avoid the risk of data inconsistency.
Therefore, if the binlog format of online MySQL is statement, it can be regarded as an unreasonable setting. You should at least set the binlog format to mixed.
For example, in this example, if it is set to mixed, it will be recorded in row format; if the limit 1 of the executed statement is removed, it will be recorded in statement format.
Now more and more scenarios require that the binlog format of MySQL be set to row. There are many reasons for doing so. For example, one advantage can be seen directly:Recover data。
Next, we will look at the problem of data recovery from the perspective of delete, insert and update SQL statements.
As can be seen from Figure 6, even if the delete statement is executed, the binlog in row format will save the whole line information of the deleted line. Therefore, if you find that you have deleted the wrong data after executing a delete statement, you can directly convert the delete statement recorded in binlog into insert and insert the deleted data back to recover.
What if the insert statement is executed incorrectly? That would be more direct. In row format, all the field information will be recorded in the binlog of the insert statement, which can be used to accurately locate the line just inserted. At this time, you can directly turn the insert statement into a delete statement and delete the misinserted line of data.
If the update statement is executed, the whole line of data before and after modification will be recorded in binlog. Therefore, if the update statement is mistakenly executed, you only need to swap the two lines of information before and after the event, and then go to the database to execute it, and the update operation can be restored.
In fact, data operation errors caused by delete, insert or update statements need to be restored to the state before the operation, which also occur from time to time. MariaDB’sFlashbackThe tool is based on the principle described above to roll back the data.
Although binlog in mixed format is not widely used now, we still need to use mixed format to illustrate a problem. Let’s take a look at this SQL statement
mysql> insert into t values(10,10, now());
If binlog format is set to mixed, will MySQL record it as row format or statement format?
Statement execution effect:
As you can see, MySQL uses the statement format. If it takes one minute for the binlog to be transferred to the standby database, then the data of the primary and standby databases will not be consistent?
Next, use the mysqlbinlog tool to see:
As can be seen from the results in the figure, the original binlog recorded one more command when recording events: set timestamp = 1546103491. It uses the set timestamp command to specify the return time of the next now() function.
Therefore, whether the binlog is executed by the backup Library in one minute or used to restore the backup of the library in three days, the value of the row inserted by the insert statement is fixed. In other words, through this set timestamp command, MySQL ensures the consistency of the active and standby data.
Example: when replaying binlog data, use mysqlbinlog to parse the log, and then copy the statement statement directly.
This approach is risky. Because the execution result of some statements depends on the context command, the result of direct execution is likely to be wrong.
Therefore, the standard way to recover data with binlog is to parse it with mysqlbinlog tool, and then send the whole parsing result to MySQL for execution. Similar to the following command:
mysqlbinlog master.000001 --start-position=2738 --stop-position=2973 | mysql -h127.0.0.1 -P13000 -u$user -p$pwd;
The meaning of this command is to parse the contents from the 2738th byte to the 2973th byte in the master.000001 file and put them into MySQL for execution.
The feature of binlog ensures that the same binlog can be executed in the standby database to get the same status as the primary database.
Therefore, it can be considered that under normal circumstances, the data of the master and the standby are consistent. In other words, the contents of nodes a and B in Figure 1 are consistent. In fact, in Figure 1, I draw the M-S structure, but in actual production, the dual M structure is used more often, that is, the main / standby switching process shown in Figure 9.
Compared with figure 9 and figure 1, it can be found that the difference between double M structure and M-S structure is only one more line, that is, nodes a and B are always in the primary standby relationship with each other. In this way, there is no need to modify the master-slave relationship when switching.
However, there is still a problem to be solved.
The business logic updates a statement on node a, and then sends the generated binlog to node B. node B will also generate binlog after executing the update statement. I suggest you change the parameter to log_ slave_ When updates is set to on, the standby database generates binlog after executing relay log.
Then, if node a is also the standby database of node B, it is equivalent to taking the newly generated binlog of node B and executing it once again. Then between node A and B, this update statement will be executed repeatedly, that is, circular replication. How to solve this problem?
As you can see from Figure 6 above, MySQL records the server ID of the instance where the command was first executed in binlog. Therefore, we can use the following logic to solve the problem of circular replication between two nodes:
- It is stipulated that the server IDs of two libraries must be different. If they are the same, the primary standby relationship cannot be set between them;
- A standby database receives the binlog and generates a new binlog with the same server ID as the original binlog during replay;
- After receiving the log sent from its main library, each library first determines the server ID. if it is the same as its own, it means that the log is generated by itself, and then discards the log directly.
According to this logic, if the dual-m structure is set, the log execution flow will be as follows:
- For transactions updated from node a, the binlog records the server ID of node a;
- After being sent to node B for execution once, the server ID of binlog generated by node B is also the server ID of A;
- Send it back to node A. if a judges that the server ID is the same as its own, it will not process the log. So, the loop is broken here.
Under what circumstances will double-m structure appear circular replication?
One scenario is to use the command set global server after a master database updates a transaction_ Id = x modified the server_ id。 When the log is sent back, the server is found_ ID and its own server_ If the ID is different, it can only be executed.
Another scenario is that when there are three nodes, as shown in Figure 7, Trx1 is executed on node B, so the server on binlog_ ID is B, binlog is passed to node a, and then a and a ‘build a double M structure, and circular replication will appear.
This kind of three node replication scenario will appear when doing database migration.
If circular replication occurs, you can execute the following command on a or a ‘:
stop slave； CHANGE MASTER TO IGNORE_SERVER_IDS=(server_id_of_B); start slave;
In this way, the node will not execute the log after receiving it. After a while, execute the following command to change the value back.
stop slave； CHANGE MASTER TO IGNORE_SERVER_IDS=(); start slave;
In an active standby relationship, each standby database receives the binlog of the primary database and executes it.
Under normal circumstances, as long as all binlogs generated by the update of the main database can be transferred to the standby database and executed correctly, the standby database can reach the state of consistency with the main database, which is the final consistency.
However, only the final consistency is not enough for MySQL to provide high availability.
Main / standby switching flow chart of dual-m structure:
Active standby switching may be an active operation and maintenance action, such as software upgrade, offline of the machine where the main library is located, or it may be a passive operation, such as power failure of the machine where the main library is located.
Let’s take a look at the scene of active switching.
Before introducing the detailed steps of the active handoff process, we first explain a concept, namely “synchronization delay”. The time points related to data synchronization mainly include the following three aspects:
- The main database a executes a transaction and writes it to binlog. We record this time as T1;
- After that, it is sent to the standby database B. we record the time when the standby database B receives the binlog as T2;
- Standby database B performs this transaction, and we record this time as T3.
The so-called active standby delay is the difference between the execution time of the standby database and the execution time of the primary database for the same transaction, which is t3-t1.
You can execute the command show slave status on the standby database, and its return result will show seconds_ behind_ Master, used to indicate how many seconds the current standby database is delayed.
seconds_ behind_ The calculation method of master is as follows:
- There is a time field in the binlog of each transaction, which is used to record the time written on the main database;
- The standby database takes out the value of the time field of the currently executing transaction, calculates the difference between it and the current system time, and obtains seconds_ behind_ master。
As you can see, in fact, seconds_ behind_ The parameter master calculates t3-t1. So, we can use seconds_ behind_ Master is used as the value of the active / standby delay. The time precision of this value is seconds.
If the system time settings of the active and standby database machines are inconsistent, will the value of the active and standby delay be inaccurate?
- Not really. Because when the standby database is connected to the main database, it will execute select_ Time stamp() function to get the system time of the current main library. If it is found that the system time of the primary database is inconsistent with that of the secondary database, the secondary database is executing seconds_ behind_ The difference will be deducted automatically when the master calculates.
It should be noted that when the network is normal, the time required for logs to be transferred from the primary database to the standby database is very short, that is, the value of t2-t1 is very small. In other words, under normal network conditions, the main source of the delay between the active and standby databases is the time difference between receiving the binlog and executing the transaction.
Therefore, the most direct performance of the primary and standby latency is that the standby database consumes the relay log more slowly than the primary database produces the binlog.
First of all, under some deployment conditions, the performance of the machine where the standby library is located is worse than that of the machine where the primary library is located.
In general, when someone deploys, the idea is that the standby library has no request anyway, so it can use a machine that is a little bit worse. Or, they will put 20 main libraries on four machines and concentrate the backup libraries on one machine.
In fact, we all know that there is no difference in the pressure of update requests on IOPs between the primary database and the standby database. Therefore, when doing this kind of deployment, the standby database is usually set to the “non dual 1” mode.
But in fact, a large number of read operations are triggered during the update process. Therefore, when multiple standby databases on the standby database host are competing for resources, it may lead to the delay of the primary and standby databases.
Of course, this kind of deployment is less now. Because the active and standby databases may be switched, and the standby database may become the primary database at any time, it is common for the primary and standby databases to select machines of the same specifications and make symmetric deployment.
Question 1:However, after symmetric deployment, there may be delays. Why?
- This is it.The second common possibility is that the standby library is under great pressure. The general idea is that since the main library provides writing ability, the standby library can provide some reading ability. Or some analysis statements needed by the operation background can not affect the normal business, so they can only run on the standby database.
Scenario: the main database has a direct impact on the business. People will be more restrained when using it, but ignore the pressure control of the standby database. The result is that the query on the standby database consumes a lot of CPU resources, affects the synchronization speed, and causes the delay between the active and standby databases.
We can generally deal with this situation as follows:
- One master and many followers. In addition to the standby library, you can connect several slave libraries to share the reading pressure.
- Through binlog output to external systems, such as Hadoop system, let the external system provide the ability of statistical query.
Among them, the way of one master and many followers is mostly adopted. Because as a database system, we must also ensure the ability of regular full backup. The slave library is very suitable for backup.
Note: the slave library and the standby library are similar in concept. In our column, for the convenience of description, I call the secondary library the one that will be selected as the new primary Library in the HA process, and the others are called the secondary library.
Question 2:One master and multiple slaves are used to ensure that the pressure of the standby database will not exceed that of the primary database. Is there any other situation that may cause the delay of the primary and standby databases?
That’s the third possibility, big business.
Big business is a very understandable situation. Because the primary database must wait for the transaction execution to complete before writing binlog to the secondary database. Therefore, if a statement on a master library is executed for 10 minutes, the transaction will probably cause the slave library to delay for 10 minutes.
A typical big business scenario: don’tDelete too much data with delete statement at one time。
For example, some archived data, usually do not pay attention to delete historical data, when the space is almost full, business developers have to delete a large number of historical data at one time. At the same time, to avoid the impact of peak operation on the business (at least it’s good to have this awareness), we will delete these large amounts of data at night.
As a result, the responsible DBA will receive a delayed alarm in the middle of the night. Then, the DBA team requires you to control the amount of data deleted in each transaction when you delete data later, and delete it multiple times.
Another typical large transaction scenario is large table DDL.
Question 3:If you don’t do big things on the primary database, is there any reason for the delay?
There is also a general reason for the delay between the active and standby devices, that isParallel replication capability of standby database。
Due to the existence of active standby delay, there are different strategies when switching between active and standby.
In the double M structure of Figure 1, the detailed process of switching from state 1 to state 2 is as follows:
- Determine the current seconds of standby B_ behind_ Master, if it is less than a certain value (such as 5 seconds), continue to the next step, otherwise continue to try this step again;
- Change the main library a to read-only state, that is, set readonly to true;
- Determine the seconds of standby B_ behind_ The value of master until it becomes 0;
- Change standby database B to read-write state, that is, set readonly to false;
- Switch the business request to standby database B.
This switching process is usually completed by a special HA system, which is temporarily called reliability priority process.
Note: SBM in the figure is seconds_ behind_ Short for the master parameter.
As you can see, there is an unavailable time in this switching process. After step 2, both the primary library a and the standby Library B are in the readonly state, that is to say, the system is in the non writable state, and can not be recovered until step 5 is completed.
In this unavailable state, the more time-consuming step is step 3, which may take several seconds. That’s why you need to make a judgment in step 1 to make sure the seconds_ behind_ The value of master is small enough.
Imagine that if the active standby delay is as long as 30 minutes at the beginning, and the system is not available for as long as 30 minutes without judgment. This situation is generally unacceptable.
Of course, the system’s unavailable time is determined by the data reliability priority strategy. You can also choose the policy of availability first to reduce the unavailable time to almost zero.
If you forcibly adjust steps 4 and 5 to the beginning, that is to say, without waiting for the primary and standby data synchronization, you can directly switch the connection to the standby database B, and the standby database B can read and write, then the system will have almost no unavailable time.
This switching process is temporarily called availability first process. The cost of this switching process is that the data may be inconsistent.
A case of data inconsistency caused by availability first process
mysql> CREATE TABLE `t` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `c` int(11) unsigned DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; insert into t(c) values(1),(2),(3);
This table defines a self incrementing primary key ID. after initializing the data, there are three rows of data on the primary and standby databases. Next, the business personnel will continue to execute two insert statement commands on table t, which are as follows:
insert into t(c) values(4); insert into t(c) values(5);
Suppose that there are a large number of updates to other tables in the main database, resulting in a delay of up to 5 seconds. After inserting a C = 4 statement, the active standby switch is initiated.
Figure 3 is a pictureAvailability first policy, and binlog_ format=mixedThe switching process and data results are given.
Now, let’s analyze the switching process:
- In step 2, the main database a completes the insert statement, inserts a row of data (4,4), and then starts the active standby switch.
- In step 3, because there is a 5-second delay between the active and standby databases, the standby database B starts to receive the client’s “insert C = 5” command before it can apply the transit log “insert C = 4”.
- In step 4, the standby database B inserts a row of data (4,5) and sends the binlog to the main database a.
- In step 5, standby database B executes the “insert C = 4” transit log and inserts a row of data (5,4). The statement “insert C = 5” executed directly in the standby database B is passed to the main database a, and a row of new data (5,5) is inserted.
The final result is that there are two rows of inconsistent data on the main library a and the standby library B. As you can see, this data inconsistency is caused by the availability first process.
Well, if I still use itAvailability first policy, but set binlog_ format=rowWhat will happen?
Because row format records all the field values of the newly inserted row when binlog is recorded, only one row will be inconsistent. In addition, the application threads of active and standby synchronization on both sides will report an error duplicate key error and stop. In other words, in this case, the two lines of data (5,4) of standby database B and (5,5) of main database a will not be executed by the other party.
From the above analysis, we can see some conclusions
- When using binlog in row format, the problem of data inconsistency is easier to be found. When using binlog in mixed or statement format, the data is likely to be inconsistent quietly. If you find the problem of data inconsistency after a long time, it is likely that the data inconsistency is no longer available, or it will cause more data logic inconsistency.
- The availability priority policy of active standby handoff will lead to data inconsistency. Therefore, in most cases, I recommend that you use the reliability first strategy. After all, for data services, data reliability is generally better than availability.
Is there any case where the availability of data has a higher priority?
- There is a library for logging operations. At this time, if the data is inconsistent, it can be repaired by binlog, and the transient inconsistency will not cause business problems.
- At the same time, the business system depends on the log writing logic. If the library is not writable, the online business operations will not be executed.
At this time, you may need to choose the strategy of forcibly switching first and then replenishing the data afterwards.
Of course, one improvement that comes to mind when resuming the disk afterwards is to let the business logic not rely on the writing of such logs. In other words, the log writing logic module should be able to be degraded, such as writing to a local file or writing to another temporary library.
In this way, this scenario can use the reliability first strategy again.
Now let’s look at it,According to the idea of reliability first, what is the effect of abnormal switching?
Suppose that the active standby delay between primary library a and standby Library B is 30 minutes. At this time, primary library a is powered down, and HA system needs to switch B as primary library. In active handoff, you can wait until the active / standby delay is less than 5 seconds to start handoff, but there is no choice.
If you adopt the reliability first strategy, you have to wait until the second of standby B_ behind_ After master = 0, you can switch. But now the situation is more serious than just now. It’s not that the system is read-only and not writable, but that the system is completely unavailable. Because, after the power failure of main library a, our connection has not been switched to standby library B.
Can I switch to standby B directly, but keep b read-only?
- It doesn’t work.
Because, during this period of time, the application of the transit log has not been completed. If the active / standby switch is initiated directly, the client query can not see the completed transaction, and it will be considered as “data loss”.
Although these data will be recovered with the continuous application of transit logs, it is not acceptable for some businesses to query the “status of temporarily lost data”.
On the premise of data reliability, the availability of MySQL high availability system depends on the delay of active and standby. The smaller the delay time, the shorter the service recovery time and the higher the availability when the main database fails.
Question 1:Several cases that cause the delay of master and slave
- The main library DML statement is more concurrent than the slave library QPS
- Poor configuration of slave servers or several slave libraries on one server (fierce resource competition, especially IO)
- The parameter configuration of master and slave libraries is different
- Big business (DDL, I think DDL is also a big business)
- Backup operation is in progress from the library
- In the case of no primary key on the table (the main database changes the data by using the index, and the backup database playback can only scan the whole table, in this case, the slave can be adjusted_ rows_ search_ Algorithms parameter optimization
- Delayed standby is set
- In case of insufficient spare space
Question 2:T3’s explanation is: the standby database executes this thing. Then: seconds_ Behind_ Master=T3-T1。 If T1 = 30min, the main execution is completed and the standby is not executed. Guess 1: then seconds_ Behind_ Master = 30 minutes? Guess 2: the standby execution needs to finish the 30 minute transaction first, and then the seconds_ Behind_ Master=30min？
- The former is 0, and the latter is 30
Question 3:Under what circumstances will the active standby delay of the standby library be represented as a 45 degree line segment?
There are two typical scenarios for this phenomenon
- One is large transaction (including large table DDL and many rows of a transaction operation);
- Another hidden situation is that the standby database has a long transaction, such as
begin; select * from t limit 1
At this time, the main database adds a field to the table T. even if the table is very small, the DDL will be blocked when the standby database is applied, and this phenomenon cannot be seen.
Main and standby flow chart:
We should pay attention to the two black arrows in the figure for the parallel replication capability of the master and the standby. One arrow represents that the client writes to the main database, and the other represents the SQL on the standby database_ Thread executes relay log. If the parallelism is represented by the thickness of the arrow, the real situation is as shown in Figure 1. The first arrow is obviously thicker than the second arrow.
On the main library, the reason that affects concurrency is all kinds of locks. Since InnoDB engine supports row locking, it is very friendly to business concurrency except for the extreme scenario that all concurrent transactions are updating the same row (hot row). Therefore, when you test the performance, you will find that the overall throughput of concurrent thread 32 is higher than that of single thread.
The execution of the log on the standby database is the SQL on the standby database in the figure_ Thread is the logic of updating data. If the single thread is used, the application log of the standby database will not be fast enough, resulting in the delay of the primary and standby databases.
Before the official version 5.6, MySQL only supported single thread replication, which resulted in serious active standby latency problems when the concurrency of the main database and TPS were high.
From single threaded replication to the latest version of multi-threaded replication, the evolution has experienced several versions.
In fact, in the final analysis, all multi-threaded replication mechanisms are designed to implement SQL with only one thread in Figure 1_ Thread, which is divided into multiple threads, conforms to the following model:
In Figure 2, coordinator is the original SQL_ Thread, but now it is no longer directly updating data, only responsible for reading transit logs and distributing transactions. The one that really updates the log becomes the worker thread. The number of work threads is determined by the parameter slave_ parallel_ Workers decided. It’s best to set this value between 8 and 16 (in the case of 32 core physical machines). After all, the standby library may still have to provide read queries, so it can’t eat up all the CPUs.
Next, we need to consider a question first: can transactions be distributed to each worker by polling, that is, the first transaction is distributed to the worker_ 1. The second transaction is sent to the worker_ What about two?
- In fact, it can’t. Because after the transaction is distributed to workers, different workers execute independently. However, due to the CPU scheduling strategy, it is likely that the second transaction will be executed before the first transaction. If the two transactions update the same row at this time, it means that the execution order of the two transactions on the same row is opposite on the primary database and the standby database, which will lead to the inconsistency between the primary and standby databases.
Next, imagine another problem: can multiple update statements of the same transaction be executed by different workers?
- The answer is No. For example, a transaction updates each row in table t1 and table t2. If the two update statements are divided into different workers, the final result is consistent between the active and standby. However, if there is a query on the standby database at the moment when table t1 is executed, you will see that the transaction “updates half of the result”, which destroys the isolation of transaction logic.
Therefore, the coordinator needs to meet the following two basic requirements when distributing:
- Cannot cause update override. This requires that two transactions in the same row be updated and must be distributed to the same worker.
- The same transaction cannot be disassembled and must be put into the same worker.
All versions of multithreaded replication follow these two basic principles.
The official MySQL version 5.5 does not support parallel replication.
There are two versions of the parallel strategy, namely the table distribution strategy and the row distribution strategy, to help understand the iteration of the parallel replication strategy of the official version of MySQL.
The basic idea of distributing transactions by table is that if two transactions update different tables, they can be in parallel. Because the data is stored in the table, the table distribution can ensure that two workers will not update the same row.
Of course, if there are cross table transactions, the two tables should be considered together. As shown in Figure 3, this is the rule of table distribution.
As you can see, each worker thread corresponds to a hash table, which is used to save the tables involved in the transactions currently in the “execution queue” of this worker. The key of hash table is “database name. Table name”, and value is a number indicating how many transactions in the queue modify the table.
When a transaction is assigned to a worker, the tables involved in the transaction will be added to the corresponding hash table. After the worker is executed, the table will be removed from the hash table.
In Figure 3, hash_ table_ 1 means worker now_ In the “to be executed transaction queue” of db1.1, there are four transactions related to the db1.t1 table and one transaction related to the db2.t2 table; hash_ table_ Now worker_ There is a transaction in table T3 2 that will be updated to the data in table T3.
Suppose that in the case of the figure, the coordinator reads in a new transaction t from the transit log, and the row modified by this transaction involves tables T1 and T3.
Now let’s take a look at the allocation rules using the allocation process of transaction t.
- Because the transaction t involves modifying table t1, and the worker_ 1. There is a transaction in the queue modifying table t1. Transaction T and a transaction in the queue need to modify the data of the same table. In this case, we say transaction T and worker_ 1 is a conflict.
- According to this logic, the conflict relationship between transaction T and each worker queue will be judged in order, and transaction T and worker queue will be found_ There are also conflicts.
- If transaction t conflicts with more than one worker, the coordinator thread will wait.
- Each worker continues to execute and modifies the hash at the same time_ table。 Suppose hash_ table_ 2. If the transaction involved in modifying table T3 is executed first, it will start from hash_ table_ Remove the item db1.t3 in DB2.
- In this way, the coordinator will find that the only worker that conflicts with transaction t is the worker_ 1, so assign it to the worker_ 1。
- The coordinator continues to read the next transit log and continues to allocate transactions.
In other words, the conflict relationship between each transaction and all workers during distribution includes the following three situations:
- If there is no conflict with all workers, the coordinator thread will allocate the transaction to the most idle woker;
- If there is a conflict with more than one worker, the coordinator thread will enter the waiting state until there is only one worker who conflicts with the transaction;
- If it conflicts with only one worker, the coordinator thread will assign the transaction to the worker with the conflict relationship.
The scheme of distribution by table has a good effect in the scene of multiple tables with uniform load. However, if you encounter a hot table, for example, when all update transactions involve a table, all transactions will be assigned to the same worker, which will become single thread replication.
To solve the problem of parallel replication of hotspot table, we need a parallel replication scheme by row. The core idea of copy by row is that if two transactions do not update the same row, they can be executed in parallel on the standby database. Obviously, this pattern requires that the binlog format must be row.
At this time, we judge whether a transaction t conflicts with a worker. The rule we use is not “modify the same table”, but “modify the same row”.
The data structure of copy by row is similar to that of copy by table, and each worker is assigned a hash table. To achieve row by row distribution, the key must be the value of “database name + table name + unique key”.
However, it is not enough to have only the primary key ID for this “unique key”. We also need to consider the following scenario: in addition to the primary key, there is a unique index a in table t1:
CREATE TABLE `t1` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, PRIMARY KEY (`id`), UNIQUE KEY `a` (`a`) ) ENGINE=InnoDB; insert into t1 values(1,1,1),(2,2,2),(3,3,3),(4,4,4),(5,5,5);
You can see that the primary key values of the rows to be updated by the two transactions are different. However, if they are assigned to different workers, it is possible that the statement of Session B will be executed first. At this time, if the value of a in the line with id = 1 is still 1, only one key conflict will be reported.
Therefore, based on the row strategy, the unique key should also be considered in the transaction hash table, that is, the key should be “library name + table name + name of index a + value of a”.
For example, in the above example, you need to execute the update T1 set a = 1 where id = 2 statement on table t1, and record the values of each field before and after modification of the whole row of data in binlog.
Therefore, when the coordinator parses the binlog of this statement, the hash table of this transaction has three entries:
- key=hash_ Func (db1 + T1 + “primary” + 2), value = 2; here, value = 2 is because the ID value of the line before and after modification is unchanged, which appears twice.
- key=hash_ Func (db1 + T1 + “a” + 2), value = 1, indicating that the row a = 2 of the table will be affected.
- key=hash_ Func (db1 + T1 + “a” + 1), value = 1, indicating that the row a = 1 in this table will be affected.
，Compared with table based parallel distribution strategy, row based parallel strategy consumes more computing resources when deciding thread distribution.In fact, these two schemes have some constraints
- You should be able to parse the table name, primary key value and unique index value from binlog. In other words, the binlog format of the main database must be row;
- A table must have a primary key;
- You cannot have a foreign key. If there are foreign keys on the table, the rows of cascade update will not be recorded in binlog, so the conflict detection is not accurate.
But fortunately, these three constraint rules are the online usage specifications that business developers must abide by before DBA, so these two parallel replication strategies have no trouble in application.
Compared with table distribution and row distribution, the parallel degree of row distribution is higher. However, if it is a large transaction with many rows to operate, there are two problems in the strategy of distribution by row
- It consumes memory. For example, if a statement wants to delete 1 million rows of data, the hash table will record 1 million items.
- It consumes CPU. Parse the binlog, and then calculate the hash value. For large transactions, the cost is still very high.
Therefore, when implementing this strategy, a threshold will be set. If a single transaction exceeds the set threshold for the number of rows (for example, if the number of rows updated by a single transaction exceeds 100000), it will temporarily degenerate into the single thread mode. The logic of the degradation process is as follows:
- The coordinator holds this transaction for the time being;
- Wait for all workers to complete the execution and become an empty queue;
- The coordinator executes the transaction directly;
- Restore parallel mode.
The official version of MySQL 5.6 supports parallel replication, but the granularity of support is parallel by library. In the hash table used to determine the distribution policy, key is the database name.
The parallel effect of this strategy depends on the pressure model. If there are multiple dB on the main database, and the pressure of each DB is balanced, the effect of using this strategy will be very good.
Compared with table based and row based distribution, this strategy has two advantages
- It is very fast to construct hash value, only the database name is needed; moreover, the number of dB on an instance is not very large, so it is not necessary to construct 1 million items.
- Binlog format is not required. Because the binlog in statement format can also easily get the database name.
However, if all the tables on your main database are in the same dB, this strategy will not work; or if the hotspots of different dB are different, such as business logic library and system configuration library, it will not work in parallel.
Theoretically, we can create different dB, divide the tables with the same heat into these different DB evenly, and force the use of this strategy. However, due to the need to specifically move data, this strategy is not used much.
Features of MariaDB’s parallel replication strategy:
- Transactions that can be submitted in the same group will not modify the same row;
- Transactions that can be executed in parallel on the main database must also be executed in parallel on the standby database.
In terms of implementation, MariaDB does this:
- Transactions committed together in a group have the same commit_ The next group is commit_ id+1；
- commit_ The ID is written directly into binlog;
- When it is transferred to the standby application, the same commit is used_ The transaction of ID is distributed to multiple workers for execution;
- After all the execution of this group is completed, the coordinator takes the next batch.
At that time, when this strategy came out, it was quite amazing. Because before, the idea of the industry was to “analyze binlog and split it into worker”. The goal of MariaDB’s strategy is to “simulate the parallel mode of the main library”.
However, this strategy has a problem, it does not achieve the goal of “real simulation of the main library concurrency”. On the main database, when a group of transactions commit, the next group of transactions are in the “executing” state at the same time.
As shown in Figure 5, assuming the execution of three sets of transactions in the main database, you can see that trx4, trx5 and trx6 are executing when Trx1, TRX2 and trx3 are submitted. In this way, when the first set of transactions is committed, the next set of transactions will soon enter the commit state.
According to MariaDB’s parallel replication strategy, the execution effect on the standby database is shown in Figure 6.
It can be seen that when executing on the standby database, the second group of transactions can only be executed after the first group of transactions is completely executed, so the throughput of the system is not enough.
In addition, this scheme is easily delayed by big business. Assuming that TRX2 is a very large transaction, when the standby database is applied, after the execution of Trx1 and trx3 is completed, the next group can only start execution after TRX2 is fully executed. During this period, only one worker thread is working, which is a waste of resources.
But even so, this strategy is still a beautiful innovation. Because, it is very few to the original system transformation, the realization is also very elegant.
After the implementation of MariaDB parallel replication, the official MySQL version 5.7 also provides a similar function. The parallel replication strategy is controlled by the parameter slave parallel type
- The configuration is database, which indicates that MySQL version 5.6 is used for per library parallelism;
- Configure as logical_ Clock is a strategy similar to MariaDB. However, MySQL 5.7 is optimized for parallelism.
All transactions in “execution state” at the same time cannot be paralleled.
Because there may be transactions in the lock waiting state due to lock conflicts. If these transactions are assigned to different workers on the standby database, there will be inconsistency between the standby database and the primary database.
The core of the strategy of MariaDB mentioned above is that “all transactions in commit” state can be paralleled. The transaction is in the commit state, indicating that it has passed the lock conflict test.
Two stage submission process chart:
In fact, you don’t have to wait until the commit phase. As long as you can reach the redo log Prepare phase, it means that the transaction has passed the lock conflict test.
Therefore, the idea of MySQL 5.7 parallel replication strategy is as follows:
- At the same time, transactions in the prepare state can be executed in parallel in the standby database;
- Transactions in the prepare state and transactions in the commit state can also be executed in parallel in the standby database.
When the binlog group is submitted, there are two parameters:
- binlog_ group_ commit_ sync_ The delay parameter indicates how many microseconds to delay before calling fsync;
- binlog_ group_ commit_ sync_ no_ delay_ The count parameter indicates how many times to call fsync.
These two parameters are used to deliberately lengthen the time from write to fsync of binlog, so as to reduce the number of disk writes of binlog. In the parallel replication strategy of MySQL 5.7, they can be used to create more “transactions in the Prepare phase at the same time”. This increases the parallelism of backup replication.
That is to say, these two parameters can not only “deliberately” make the main database submit slower, but also make the standby database execute faster. When MySQL 5.7 deals with standby database delay, we can consider adjusting these two parameter values to improve the replication concurrency of standby database.
In MySQL 5.7.22 released in April 2018, a new parallel replication strategy based on writeset is added to MySQL.
Accordingly, a new parameter binlog transaction dependency tracking is added to control whether the new policy is enabled or not. There are three optional values for this parameter.
- COMMIT_ Order, which represents the strategy described above, determines whether it can be paralleled by entering prepare and commit at the same time.
- Writeset, which means to calculate the hash value of each row involved in the update of the transaction to form a set writeset. If two transactions do not operate on the same row, that is, their writesets do not intersect, they can be paralleled.
- WRITESET_ Session is an additional constraint on the basis of writeset, that is, two transactions executed successively by the same thread in the main database should be executed in the same order in the standby database.
Of course, for unique identification, this hash value is calculated by “database name + table name + index name + value”. If there are other unique indexes on a table besides the primary key index, then for each unique index, the writeset corresponding to the insert statement needs to add an extra hash value.
This is similar to the row by row distribution strategy based on MySQL 5.5. However, the official implementation of MySQL still has great advantages
- The writeset is directly written to the binlog after the primary database is generated. In this way, when the secondary database is executed, there is no need to parse the binlog content (the row data in the event), which saves a lot of computation;
- It is not necessary to scan the binlog of the whole transaction to decide which worker to distribute to, which saves more memory;
- Since the distribution strategy of the standby database does not depend on the content of binlog, it is also possible for binlog to be in statement format.
Therefore, the parallel replication strategy of MySQL 5.7.22 is universal.
Of course, for the scenario of “no primary key on the table” and “foreign key constraint”, the writeset strategy can not be paralleled, and will temporarily degenerate into a single threaded model.
Question 1:MariaDB’s parallel strategy, when there are three transactions in the same group, they all change the same field value in the same row, and their commit_ If the ID is the same, it can be executed in parallel in the slave library. How can the sequence of the three ensure that the final result of the field in the row is consistent with that of the master library?
- redo log prepare write
- binlog write
- redo log prepare fsync
- binlog fsync
- redo log commit write
- If a record is locked, only one transaction can be executed, and other transactions wait for the lock.
In step 4, because of the following two parameters, other transactions without lock conflict will be flushed together. At this time, the transactions executed together have the same commit_ id
- After step 5, release the lock and wait for the lock transaction to start.
Therefore, it is impossible to have the same commit for transactions updated on the same row_ id
Question 2:If the master database is in single thread pressure mode, what parameters should be selected for binlog transaction dependency tracking when the slave database pursues the master database?
- This parameter should be set to writeset.
Since the main library is in the single thread pressure mode, the commit of each transaction is limited_ If the ID is different, set it to commit_ In order mode, the slave library can only be executed by a single thread.
Similarly, due to writeset_ Session mode requires that when the standby database applies the log, the log of the same thread must be executed in the same order as that of the main database, which also causes the main database to degenerate into single thread replication under single thread pressure mode.
Therefore, binlog transaction dependency tracking should be set to write set.
From: geek time “45 lectures on MySQL” – Lin Xiaobin
This work adoptsCC agreementReprint must indicate the author and the link of this article