Summary after reading MySQL (3)


There is something wrong with the master library. What about the slave library?

The basic one master multi slave structure is as follows

Summary after reading MySQL (3)

In the figure, the dashed arrow indicates the master-slave relationship, that is, a and a ‘are master-slave to each other, and the slave libraries B, C and D point to the master library a. The setting of one master and multiple slaves is generally used for the separation of read and write. The master library is responsible for all the writes and part of the reads, while the other read requests are shared by the slave library.

As shown in Figure 2, when the main database fails, the result of the active / standby switch is obtained.

Summary after reading MySQL (3)

Compared with the handover process of one master and one standby, after the handover of one master and multiple slaves, a ‘will become a new master library, and slave libraries B, C and D will also be connected to a’. It is precisely because of the process of redirecting from libraries B, C and D that the complexity of active / standby handoff increases accordingly.

Site based active standby handoff

When we set node B as the slave Library of node a ‘, we need to execute a change master command:


This command has six parameters:

  • MASTER_ HOST、MASTER_ PORT、MASTER_ User and master_ The four parameters of password represent the IP, port, user name and password of main library a ‘.
  • The last two parameters are master_ LOG_ File and master_ LOG_ POS indicates that the master of the main library should be selected_ log_ The master of the name file_ log_ The log at POS continues to synchronize. This location is what we call the synchronization site, that is, the file name and log offset of the main library.

There is a problem here. If node B is to be set as the slave Library of a ‘, it is necessary to execute the change master command, and it is inevitable to set these two parameters of the site. But how should these two parameters be set?

The original node B is the slave Library of a, and the local record is also the site of A. But for the same log, the site of a is different from that of a ‘. Therefore, to switch from library B, you need to go through the logic of “find synchronization site”.

This site is difficult to get accurately, only a general location can be taken.

Considering that data cannot be lost in the switching process, when searching for a location, we always need to find a “slightly forward” one, and then skip those transactions that have been executed on slave Library B by judgment.

A method of taking synchronous sites is as follows:

  1. Wait for the new main database a ‘to synchronize all relay logs;
  2. Execute the command show master status on a ‘to get the latest file and position on a’;
  3. Take the fault time t of original main library a;
  4. Use mysqlbinlog tool to parse the file of a ‘and get the site at t.
mysqlbinlog File --stop-datetime=T --start-datetime=T

Summary after reading MySQL (3)

In the picture, end_ log_ The value “123” after “POS” indicates the location of the instance “a”, where the new binlog is written at time t. Then, we can use the value of 123 as $master_ log_ POS, used in the change master command of node B.

Of course, this value is not accurate. Why?

Suppose that at t, the main library a has executed an insert statement, inserted a row of data R, and passed the binlog to a ‘and B. then, at the moment of transmission, the host of main library a will power down.

Then, the state of the system is as follows:

  1. On slave Library B, because binlog is synchronized, the row r already exists;
  2. On the new main database a ‘, the line r already exists, and the log is written after 123;
  3. When we execute the change master command on the slave Library B and point to the 123 position of the file file of a ‘, we will synchronize the binlog inserted in the row r to the slave Library B for execution.

At this time, the synchronization thread from library B will report duplicate entry ‘ID_ of_ R ‘for key’ primary ‘error, indicating a primary key conflict, and then stop synchronization.

So,Usually, when we switch tasks, we should take the initiative to skip these errors. There are two common methods.

One way is to, actively skip a transaction. The skip command is written as follows:

set global sql_slave_skip_counter=1;
start slave;

In the process of switching, there may be more than one transaction to be executed repeatedly, so it is necessary to continuously observe when the slave Library B receives the new master library a ‘at the beginning, stop every time these errors are encountered, and execute the skip command once until the stop does not occur again, so as to skip all the transactions that may be involved.

The other way is,By setting slave_ skip_ The errors parameter is set directly to skip the specified error.

There are two types of errors that are often encountered when switching between active and standby devices

  • 1062 error is the only key conflict when inserting data;
  • 1032 the error was that no row was found while deleting the data.

Therefore, we can use slave to_ skip_ Errors is set to “10321062”, so that when these two errors are encountered in the middle, they will be directly skipped.

It should be noted that this method can only be used to create the master-slave relationship between the slave database and the new master database because the exact synchronization site cannot be found during the master-slave switch.

In this context, we are very clear that it is lossless to skip 1032 and 1062 errors in the process of active / standby switching, so we can set slave in this way_ skip_ Error parameter. After the synchronization relationship between the master and the slave is established and stably executed for a period of time, we need to set this parameter to null, so as to avoid the inconsistency of master-slave data and skipping.


Through SQL_ slave_ skip_ Counter skips transaction and passes slave_ skip_ Although the master-slave relationship between the slave Library B and the new master library a ‘can be established in the end, both operations are complex and error prone. Therefore, MySQL version 5.6 introduces gtid, which completely solves this problem.

So, what does gtid mean and how to solve the problem of finding synchronization sites?

The full name of gtid is global transaction identifier, which is the global transaction ID. it is generated when a transaction is submitted and is the unique identification of the transaction. It consists of two parts


Among them:

  • server_ UUID is automatically generated when an instance is started for the first time, and it is a globally unique value;
  • Gno is an integer with an initial value of 1. Each time a transaction is submitted, it is allocated to the transaction and 1 is added.

In the official MySQL document, the gtid format is defined as follows:


Source here_ ID is server_ UUID; and the following transaction_ ID, easy to mislead, so changed to gno. Why use transaction_ Is ID easy to be misunderstood?

Because, in mysql, we say transaction_ ID refers to the transaction ID. the transaction ID is allocated during the execution of the transaction. If the transaction is rolled back, the transaction ID will be incremented. Gno is allocated only when the transaction is submitted.

From the effect point of view, gtid is often continuous, so we use gno to express it, which is easier to understand.

The startup of gtid mode is also very simple. We only need to add the parameter gtid when starting a MySQL instance_ Mode = on and enforce_ gtid_ Consistence = on.

In gtid mode, each transaction corresponds to a gtid one by one. There are two ways to generate this gtid, and which way to use depends on the session variable gtid_ The value of next.

  1. If gtid_ Next = automatic, which means to use the default value. At this time, MySQL will put the server_ uuid:gno Assigned to this transaction.
    a. When recording binlog, record a line of set [email protected]@ SESSION.GTID_ NEXT=‘server_ uuid:gno ’;
    b. Add this gtid to the gtid collection of this instance.
  2. If gtid_ Next is the value of a specified gtid, such as set gtid_ next=’current_ Gtid ‘is specified as current_ Gtid, then there are two possibilities:
    a. If current_ The gtid already exists in the gtid set of the instance, and the next transaction will be directly ignored by the system;
    b. If current_ If the gtid does not exist in the gtid collection of the instance, the current_ Gtid is assigned to the next transaction to be executed, that is, the system does not need to generate a new gtid for this transaction, so gno does not need to add 1.

Notice, a current_ Gtid can only be used by one transaction. After the transaction is committed, if you want to execute the next transaction, you need to execute the set command to change the gtid_ Next is set to another gtid or automatic.

In this way, each MySQL instance maintains a gtid set, which corresponds to “all transactions executed by this instance”.

Use a simple example to illustrate the basic usage of gtid.

Create a table t:

  `id` int(11) NOT NULL,
  `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)

insert into t values(1,1);

Summary after reading MySQL (3)

As you can see, there is a set before the begin of the [email protected]@ SESSION.GTID_ Next command. At this time, if instance X has a slave library, the binlog of the create table and insert statements will be executed synchronously in the past, and the two set commands will be executed before the transaction is executed. In this way, the gtids in the slave library set are the two gtids in the figure.

Suppose that this instance x is the slave Library of another instance y, and the following insert statement is executed on instance Y:

insert into t values(1,1);

Moreover, the gtid of this statement on instance y is “aaaaaa CCCC dddd eeee FFF” fffffffff:10 ”。

Then, instance x, as the slave Library of Y, will synchronize this transaction to execute. Obviously, there will be a primary key conflict, which will cause the synchronization thread of instance x to stop. At this time, what should be done?
You can execute the following sequence of statements:

set gtid_next='aaaaaaaa-cccc-dddd-eeee-ffffffffffff:10';
set gtid_next=automatic;
start slave;

The first three statements are used to add the gtid to the gtid set of instance X by submitting an empty transaction. As shown in Figure 5, it is the result of show master status after executing this empty transaction.

Summary after reading MySQL (3)

You can see the executed of instance X_ Gtid_ The gtid has been added to the set.

In this way, when the start slave command is executed again to make the synchronization thread execute, although the transaction passed by instance y will continue to be executed on instance x, because “aaaaaaaa CCCC dddd eeee FFF” is not the same as “aaaaaaaa CCCC dddd eeee FFF” fffffffff:10 ”It already exists in the gtid collection of instance x, so instance x will skip this transaction directly, and there will be no primary key conflict error.

In the above sequence of statements, the start slave command is preceded by a set gtid_ next=automatic。 The function of this sentence is to “restore the default allocation behavior of gtid”. That is to say, if there is a new transaction to be executed later, the original allocation method will be used to continue to allocate gno = 3.

Active standby handover based on gtid

In gtid mode, the syntax of slave Library B to be set as new master library a ‘is as follows:


Among them, master_ auto_ Position = 1 indicates that the master-slave relationship uses the gtid protocol. As you can see, master_ LOG_ File and master_ LOG_ POS parameters no longer need to be specified.

Execute the start slave command on instance B. the logic of binlog is as follows:

  1. Instance B specifies the primary database a ‘, and establishes a connection based on the primary and standby protocols.
  2. Example B sets the_ B to the main library a ‘.
  3. Example a ‘calculates set_ A and set_ The difference set of B, that is, all the difference sets existing in set_ a. But it doesn’t exist in set_ B’s GITD set to determine whether a’contains all binlog transactions required by the difference set.
    a. If not, it means that a ‘has deleted the binlog required by instance B, and an error is returned directly;
    b. If it is confirmed that all of them are included, a’will find out the first one is not in the set from its own binlog file_ B, to B;
  4. Then start with this transaction, read the file later, and send binlog to B in order to execute.

In fact, this logic contains a design idea: in the master-slave relationship based on gtid, the system believes that as long as the master-slave relationship is established, the logs sent by the master database to the slave database must be complete. Therefore, if the log required by instance B no longer exists, a ‘refuses to send the log to B.

This is different from the site-based master-slave protocol. The site based protocol is determined by the backup database. The backup database specifies which site, and the master database sends which site. The integrity of the log is not judged.

Let’s take a look at how the master-slave handoff is implemented in the scenario of one master and multiple slaves with the introduction of gtid.

Since there is no need to find the site, you only need to execute the change master command from library B, C and D to point to instance a ‘.

In fact, strictly speaking, it is not that there is no need to find a site for active / standby switching. Instead, the work of finding a site has been automatically completed in instance a ‘. But because this work is automatic, it is very friendly for HA system developers.

After that, the system is written by the new main database a ‘. The gtid set format in the binlog generated by the main database a is: server_ uuid_ of_ A’:1-M。

If the previous gtid set format of slave Library B is server_ uuid_ of_ A: After switching, the format of gtid set becomes server_ uuid_ of_ A:1-N, server_ uuid_ of_ A’:1-M。

Of course, the master library a ‘is also the backup Library of a, so the gtid set of master library a’ and slave Library B are the same.

Gtid and online DDL

In the previous business peak slow query performance problems, if it is due to index loss caused by performance problems, can be solved by online indexing. However, in order to avoid the impact of new indexes on the performance of the primary database, we can add indexes to the secondary database first, and then switch.

In the double-m structure, the DDL statements executed by the standby database will also be passed to the main database. In order to avoid the impact on the main database after being returned, set SQL is used_ log_ Bin = off turn off binlog.

In this way, an index is added to the database, but the binlog does not record this update. Will it cause inconsistency between the data and the log?

Let’s assume that these two databases are still instance X and instance y, and the current primary database is x, and gtid mode is turned on. At this time, the active / standby switching process can be as follows:

  • Execute stop slave on instance X.
  • Execute the DDL statement on instance y. Note that there is no need to close binlog here.
  • After execution, find out the gtid corresponding to the DDL statement and record it as server_ uuid_ of_ Y:gno。
  • Execute the following sequence of statements on instance X:
    set GTID_NEXT="server_uuid_of_Y:gno";
    set gtid_next=automatic;
    start slave;

The purpose of this is to make the update of instance y have binlog record, and ensure that the update will not be executed on instance X.

  • Next, perform the master / standby switch, and then follow the above process again.

Question 1:In practice, master-slave backup seems to be the most widely used high availability solution for MySQL.
However, there are too many problems in the master-slave backup scheme

  1. Before binlog data transmission, the main database is down, resulting in the loss of submitted transaction data.
  2. With one master and multiple slaves, even with semi synchronization, binlog can only be guaranteed on at least two machines. No mechanism can select the slave library with the most complete binlog as the new master library.
  3. Master slave switching involves human operation rather than full automation. Even when gtid is used, binlog will be deleted and the slave library needs to be redone.
  4. If the two instances of mutual primary standby are all down, MySQL will not be available directly.

MySQL should have a more powerful and complete high availability solution (similar to Zab protocol or raft protocol), but in the actual environment, why is the master-slave backup used most?

  • This should be automated.
  • 4. The probability is relatively small. In fact, even the other three node scheme can’t hold two instances, so this is not the main and standby pot of MySQL.

Up to now, MySQL has provided many options. Many are the result of business tradeoffs.

  • For example, in asynchronous replication, data may be lost when the main database is abnormally powered down.
  • After you know this, some of them have been changed to semi sync, but some of them still keep the asynchronous replication mode, because semi Sync has performance impact (35% at the beginning, 15% now, depending on the specific environment). However, these businesses may think that they have lost one or two lines, which can be compensated from the application layer log. The asynchronous replication mode is retained.

Finally, there are some historical reasons why master-slave backup is used most. Many years ago, when MySQL was about to become popular, we found that the active standby mode was very convenient, so we all used it.
However, the solutions based on other protocols all appeared later, and there were still some bugs one after another.
When it comes to online services, everyone’s enthusiasm for using new solutions is always limited to the test environment.

Semi sync is also stable in recent years, and has been used as the default configuration by some companies.

Question 2:In gtid mode, if a new slave library is connected to the master library, but the required binlog is gone, what should I do?

  1. If the business allows inconsistency between master and slave, you can first execute show global variables like ‘gtid’ on the master database_ Get the deleted gtid set of the main library, assuming it is gtid_ Purged1; then execute reset master on the slave library first, and then set global gtid_ purged =‘gtid_ Finally, start slave is executed, and the synchronization starts from the existing binlog in the main database. For the missing part of binlog, the data may be lost in the slave database, resulting in inconsistency between the master and the slave.
  2. If the master-slave data needs to be consistent, it is better to rebuild the slave library.
  3. If there are other slave libraries with full amount of binlog, you can connect the new slave library to the slave library with full amount of binlog. After catching up with the log, if necessary, you can connect it back to the master library.
  4. If the binlog is backed up, you can first apply the missing binlog to the slave library, and then execute start slave.

What are the pitfalls of read-write separation

Summary after reading MySQL (3)

The main goal of read-write separation is to share the pressure of the main database. The structure in Figure 1 is that the client takes the initiative to balance the load. In this mode, the connection information of the database is usually placed in the connection layer of the client. In other words, the client selects the back-end database to query.

Another architecture is that there is an intermediate proxy layer between MySQL and the client. The client only connects to the proxy, and the proxy decides the request distribution route according to the request type and context.

Summary after reading MySQL (3)

What are the characteristics of the client direct connection architecture and the read-write separation architecture with proxy.

  1. Due to the lack of a layer of proxy forwarding, the query performance of the client direct connection scheme is slightly better, and the overall architecture is simple, so it is more convenient to check problems. However, in this scheme, due to the need to understand the back-end deployment details, the client will be aware of and need to adjust the database connection information when there are active standby switching, library migration and other operations.
    You may think that the client is too troublesome, with a lot of redundant information and ugly architecture. In fact, it may not be. Generally, such an architecture will be accompanied by a component responsible for managing the back end, such as zookeeper, so that the business end will only focus on business logic development.
  2. The architecture with proxy is more friendly to the client. The client does not need to pay attention to the back-end details. The connection maintenance and back-end information maintenance are all done by the proxy. But in this case, the back-end maintenance team will be more demanding. Moreover, proxy also needs a high availability architecture. Therefore, the whole system with proxy architecture is relatively complex.

At present, the trend is towards the architecture with proxy.

This phenomenon of “reading an expired state of the system from the library” is temporarily called “overdue reading”

No matter which structure, the client wants to query the data result of the slave database, which is the same as that of the master database.

How to deal with the problem of overdue reading:

  • Forced to take the main library scheme;
  • Sleep scheme;
  • Judge the active standby no delay scheme;
  • Cooperate with semi sync scheme;
  • The equal main library site scheme;
  • And so on.

The scheme of forced main storage

In fact, the mandatory main database scheme is to classify query requests. Generally, query requests can be divided into two categories:

  1. For requests that must get the latest results, force them to be sent to the main library. For example, on a trading platform, after the seller publishes the product, he will immediately return to the main page to see if the product is successfully published. Then, if the request needs to get the latest results, it must go to the main database.
  2. For the request that can read the old data, it will be sent to the slave library. On this trading platform, it is acceptable for buyers to visit the shop page, even if they see the latest products a few seconds later. Then, this kind of request can go from the library.

Is this scheme a bit of fear of difficulties and trickery, but in fact, this scheme is used most.

Of course, the biggest problem with this solution is that sometimes you will encounter the demand that “all queries cannot be overdue”, such as some financial businesses. In this way, it is necessary to give up the separation of reading and writing. All the reading and writing pressure is in the main library, which is equivalent to giving up the scalability.

Sleep scheme

After the master library is updated, sleep before reading the slave library. The specific solution is similar to executing a select sleep (1) command.

The assumption of this scheme is that in most cases, the active and standby delay is less than 1 second, and there is a high probability to get the latest data by doing a sleep.

The first impression of this scheme is that it probably doesn’t depend on music score, so no one will use it, right? In addition, it may be said that the user experience is very unfriendly when executing a sleep statement directly when launching a query.

However, this idea can really solve the problem to a certain extent.

Take the seller’s product release as an example. After the product release, AJAX (asynchronous Javascript + XML, asynchronous JavaScript and XML) is used to directly display the content entered by the client as a “new product” on the page, instead of actually querying the database.

In this way, the seller can confirm that the product has been released successfully through this display. When the seller refreshes the page to view the goods, it has been a while, and the purpose of sleep has been achieved, and the problem of overdue reading has been solved.

In other words, this sleep solution does solve the problem of overdue reading in similar scenarios. However, in a strict sense, the problem with this scheme is that it is not accurate. This imprecision has two meanings

  1. If the query request can get the correct result from the database in 0.5 seconds, it will wait for 1 second;
  2. If the delay exceeds 1 second, there will still be expired reads.

Judge the active standby no delay scheme

There are usually three ways to ensure that the standby database has no delay.
Seconds in show slave status result_ behind_ The value of the master parameter can be used to measure the length of the active and standby delay time.

The first method is to ensure that there is no delay between master and standbyBefore each query request is executed from the library, determine the seconds_ behind_ Whether master is already equal to 0. If it is not equal to 0, you must wait until this parameter changes to 0 to execute the query request.

seconds_ behind_ The unit of the master is seconds. If you think the accuracy is not enough, you can also use the method of comparing the site and gtid to ensure that there is no delay between the master and the standby, which is the second and third method we will talk about next.

As shown in Figure 3, this is a partial screenshot of the show slave status result.

Summary after reading MySQL (3)

The second method,Compare sites to ensure that there is no delay between master and standby

  • Master_ Log_ File and read_ Master_ Log_ POS, which represents the latest site of the main library;
  • Relay_ Master_ Log_ File and Exec_ Master_ Log_ POS, which represents the latest execution site of the standby database.

If master_ Log_ File and relay_ Master_ Log_ File、Read_ Master_ Log_ POS and Exec_ Master_ Log_ The two groups of POS values are exactly the same, which means that the received logs have been synchronized.

The third way,Compare gtid set to ensure that there is no delay between active and standby

  • Auto_ Position = 1 indicates that gtid protocol is used for the master-slave relationship.
  • Retrieved_ Gtid_ Set is the gtid set of all logs received by the standby database;
  • Executed_ Gtid_ Set is the set of gtids that have been executed in the standby database.

If the two sets are the same, it also means that the logs received by the standby database have been synchronized.

If the two sets are the same, it also means that the logs received by the standby database have been synchronized.

It can be seen that the two methods of comparing loci and gtid are better than judging seconds_ behind_ Whether master is 0 is more accurate.

Compared with the sleep scheme, the accuracy of this method is improved a lot, but it is still not “accurate”. Why do you say that?

The status of a transaction’s binlog between the active and standby databases:

  1. After the main database is executed, binlog is written and fed back to the client;
  2. Binlog is sent from the primary database to the standby database, and the standby database receives it;
  3. Execute binlog in the standby database.

The above logic to judge whether there is no delay between the active database and the standby database is “all logs received by the standby database have been executed”. However, from the analysis of the state of binlog between the active and standby, it is not difficult to see that there are still some logs in the state that the client has received the submission confirmation, but the standby database has not received the log.

This is the state shown in Figure 4.

Summary after reading MySQL (3)

At this time, three transactions, Trx1, TRX2 and trx3, are executed on the main database

  1. Trx1 and TRX2 have been transferred to the slave library and have been executed;
  2. Trx3 is executed in the master library and has been returned to the client, but it has not been transferred to the slave library.

If the query request is executed on the slave Library B at this time, according to the above logic, the slave library thinks that there is no synchronization delay, but trx3 cannot be found. Strictly speaking, there are overdue readings.

terms of settlement:

Cooperate with semi sync

Semi synchronous replication is introduced.

Semi sync is designed as follows:

  1. When a transaction is committed, the master sends the binlog to the slave;
  2. After receiving the binlog from the slave database, send back an ACK to the master database, indicating that it has been received;
  3. After receiving this ACK, the main library can return the confirmation of “transaction completion” to the client.

In other words, if semi sync is enabled, it means that all transactions that have sent acknowledgement to the client have ensured that the backup database has received the log.

If some binlogs are not sent to the slave database when the master database is powered down, will the system data be lost?

  • The answer is that if you use the normal asynchronous replication mode, it may be lost, but semi sync can solve this problem.

In this way, semi sync can determine the query request executed on the slave database with the previous judgment on the site, and can avoid overdue reading.

However, the semi sync + site judgment scheme is only valid for one active and one standby scenario. In a master-slave scenario, the master library only needs to wait for the ack of a slave library to return confirmation to the client. At this time, there are two situations when executing query request on the slave Library:

  1. If the query falls on the slave library responding to ACK, it can ensure that the latest data is read;
  2. But if the query falls to other slave libraries, they may not have received the latest log, which will cause the problem of overdue reading.

In fact, there is another potential problem in the scheme of determining the synchronization site, that is, if the site or gtid set of the master database is updated quickly at the peak of business update, then the above two site equivalence judgment will not be tenable, and it is likely that the slave database will be unable to respond to the query request.

In fact, going back to the original business logic, we don’t need to wait until the “master and standby are fully synchronized” to get accurate results after a query request is initiated.

Take a look at this sequence diagram
Summary after reading MySQL (3)

Figure 5 shows a bad case of the wait site scheme. In the figure, the dotted box under standby B represents the transactions in relaylog and binlog respectively. As you can see, in Figure 5, from state 1 to state 4, it is always in the state of delaying a transaction.

There is a delay between standby database B and primary database a until state 4. If the above scheme is used, the select statement cannot be executed until state 4.

However, in fact, the client initiates the select statement after sending the Trx1 update. Just make sure that the Trx1 has been executed. In other words, if the query request is executed in state 3, the expected result will be obtained.

There are two problems in the semi sync scheme

  1. In the case of one master and multiple slaves, there will be the phenomenon of overdue read in some slave databases;
  2. In the case of continuous delay, there may be the problem of excessive waiting.

Next, we can solve these two problems by using the equal main library site scheme.

Equal main library site scheme

Let’s start with a command:

select master_pos_wait(file, pos[, timeout]);

The logic of this command is as follows:

  1. It is executed from the library;
  2. The parameters file and POS refer to the file name and location on the main library;
  3. Timeout optional, set to a positive integer, n means that this function can wait for N seconds at most.

The normal return result of this command is a positive integer m, which indicates how many transactions have been executed from the beginning of the command to the end of the binlog position indicated by file and POS.

Of course, in addition to the normal return of a positive integer m, this command will also return some other results, including:

  1. If an exception occurs to the standby synchronization thread during execution, null is returned;
  2. If the waiting time exceeds N seconds, – 1 is returned;
  3. If it is found that this position has been executed at the beginning of execution, 0 will be returned.

For the logic of executing Trx1 first and then a query request in Figure 5, to ensure that the correct data can be found, you can use this logic:

  1. After the Trx1 transaction is updated, immediately execute show master status to get the file and position executed by the current master database;
  2. Select a slave to execute the query;
  3. Execute select master on the slave library_ pos_ wait(File, Position, 1);
  4. If the return value is a positive integer of > = 0, the query statement will be executed in this slave library;
  5. Otherwise, the query is executed in the main library.

The process is as follows:

Summary after reading MySQL (3)

Let’s assume that this select query will wait at most one second on the slave library. So, if the master_ pos_ Wait returns an integer greater than or equal to 0, which ensures that the query result executed from the library must contain Trx1 data.

Step 5 to the main database to execute the query statement is the common degradation mechanism of this kind of scheme. Because the delay time of the slave database is uncontrollable and cannot wait indefinitely, if the waiting times out, you should give up and go to the master database to check.

If all the slave libraries are delayed for more than 1 second, then the query pressure will run to the master library? That’s true.

However, according to our requirement that overdue reading is not allowed, there are only two options: one is to give up after timeout, and the other is to go to the main database for query. Specific how to choose, need business development to do a good job of current limiting strategy.

Gtid scheme

If the gtid mode is enabled in your database, there is also a scheme to wait for gtid.

MySQL also provides a similar command:

 select wait_for_executed_gtid_set(gtid_set, 1);

The logic of this command is:

  1. Wait until the transaction executed by the library contains the passed in gtid_ Set, return 0;
  2. Timeout returns 1.

In the previous scenario, after the transaction is executed, you need to go to the main database to execute show master status. Since MySQL version 5.7.6, the gtid of the update transaction can be returned to the client after the update transaction is executed. In this way, the scheme of waiting for gtid can reduce one query.

At this time, the gtid execution process becomes:

  1. After the Trx1 transaction is updated, the gtid of the transaction is obtained directly from the return package, which is recorded as gtid1;
  2. Select a slave to execute the query;
  3. Execute select wait on the slave library_ for_ executed_ gtid_ set(gtid1, 1);
  4. If the return value is 0, the query statement will be executed in this slave library;
  5. Otherwise, the query is executed in the main library.

Just like the scheme of waiting for the location of the main database, it needs business development to consider whether to query the main database directly after the timeout.

The process is as follows:

Summary after reading MySQL (3)

In the first step above, after the Trx1 transaction is updated, the gtid of the transaction is obtained directly from the return package. The question is, how can MySQL bring gtid to the return package after executing the transaction?

  • You only need to change the parameter session_ track_ Gtids set to own_ Gtid, and then through the API interface mysql_ session_ track_ get_ First, resolve the value of gtid from the returned package.

Question 1:
About MySQL_ reset_ How to use this kind of interface?

For example, in order to display the GITD returned by the client after the transaction is submitted, the MySQL client code is modified as follows:

Summary after reading MySQL (3)

In this way, you can see that the statement execution is completed and the GITD value is displayed.

Summary after reading MySQL (3)

This is just an example. When you want to use this solution, you should call mysql_ in your client code. session_ track_ get_ The first function.

Question 2:
If we use gtid and other sites to do read-write separation, what will happen when we do DDL for large tables?

Suppose that it takes 10 minutes for this statement to be executed on the main database, and 10 minutes for it to be passed to the standby database after it is submitted (a typical large transaction). Then, when the gtid of the transaction submitted after the DDL of the main database goes to the standby database for query, it will wait 10 minutes for it to appear.

In this way, the read-write separation mechanism will time out within 10 minutes, and then go to the main library.

This kind of expected operation should ensure that the main database can support all business queries during the low peak period of business, and then switch all read requests to the main database, and then do DDL on the main database. Wait for the standby database delay to catch up, and then switch the read request back to the standby database.

The main concern is the impact of large transaction peer scheme.

Of course, using GH OST solution to solve this problem is also a good choice.

Question 3:
If there is only one server to judge whether there is a problem with the database, the update method is adopted. If it is a master-slave architecture, there is only one statement. If it is a dual master architecture, there are two update statements. But this method has great disadvantages, only one process to judge the database problem, there will be a single point of judgment. Therefore, multiple odd processes are prepared for judgment. If more than half of the monitoring processes think that the database is faulty, they will switch.
1.innodb_ thread_ Should concurrency be set in direct proportion to the number of computer cores, generally about 1.5-2 times?

  • Although in theory, it is best to double the number of cores, now many people create MySQL on a virtual machine, which is divided into 1-2 cores.

2. How come the space is full, the database can’t log on, all the connections can’t connect, let alone execute the select statement, what’s the reason?

  • Full space itself will not cause disconnection. However, because the space is full, the transaction cannot be committed, which may lead to the subsequent external transaction retrial. The new retried business is still blocked in the commit phase, and the continuous accumulation may use up the number of connections.

Question 4:
During external detection, the active and standby use the same UPDATE statement, resulting in row conflict and stopping synchronization

External detection only looks at the return time of the update statement, health_ The check table is available in both primary and secondary databases. Why does it cause row conflicts? Why does the synchronization stop?

  • For example, both tables are empty at the beginning,
    Then the first statement is executed
    insert into mysql.health_check(id, t_modified) values (1, now()) on duplicate key update t_modified=now();
    The binlog log of an insert statement will be written on both sides, and the synchronization will be stopped when it is sent to the opposite side.

Judge whether a database has problems

There are two scenarios of active standby handoff, one is active handoff, the other is passive handoff. The passive handoff is usually initiated by the HA system because the main database has a problem.

Select 1 judgment

The successful return of select 1 only indicates that the process of the library is still in progress, but it does not mean that the main library is OK. Now, let’s take a look at the scene.

set global innodb_thread_concurrency=3;

  `id` int(11) NOT NULL,
  `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)

 insert into t values(1,1)

Summary after reading MySQL (3)

Setting InnoDB_ thread_ The purpose of concurrency parameter is to control the upper limit of concurrent threads in InnoDB. That is to say, once the number of concurrent threads reaches this value, InnoDB will enter the waiting state when it receives a new request until a thread exits.

Here, InnoDB_ thread_ Concurrency is set to 3, which means that InnoDB only allows three threads to execute in parallel. In our example, sleep (100) in the first three sessions makes these three statements in the “execution” state to simulate large queries.

In session D, select 1 can be executed successfully, but the statement of query table T will be blocked. That is to say, if select 1 is used to check whether the instance is normal at this time, no problem can be detected.

In InnoDB, InnoDB_ thread_ The default value of the concurrency parameter is 0, which means that there is no limit on the number of concurrent threads. However, it is definitely not possible to limit the number of concurrent threads. Because the number of CPU cores of a machine is limited, and all threads rush in, the cost of context switching will be too high.

Therefore, in general, InnoDB is recommended_ thread_ Concurrency is set to a value between 64 and 128. At this time, there will be a question, what is the purpose of setting the maximum number of concurrent threads to 128? The number of concurrent connections on the line is always thousands.

Concurrent join and concurrent query are not the same concept. The thousands of connections you see in the result of show processlist refer to concurrent connections. The “currently executing” statement is the concurrent query.

When the number of concurrent connections reaches several thousand, the impact is not big, it just takes up more memory. We should pay attention to concurrent queries, because too high concurrent queries are the CPU killer. This is why InnoDB needs to be set_ thread_ The reason for the concurrency parameter.

If InnoDB_ thread_ If concurrency is set to 128, will 128 be consumed quickly when the same hot update problem occurs? Will the whole system hang up?

  • actually,After a thread enters lock wait, the count of concurrent threads is reduced by oneThat is to say, the thread of equal row lock (including clearance lock) is not included in 128.

It is very meaningful to design MySQL in this way. This is because the threads that enter the lock waiting do not eat the CPU any more. More importantly, it must be designed in this way to avoid locking the whole system.

Why? Suppose that the number of threads waiting for a lock also accounts for the number of concurrent threads

  1. Thread 1 executes begin; update t set C = C + 1 where id = 1, starts transaction Trx1, and then maintains this state. At this time, the thread is idle, not included in the concurrent thread.
  2. Thread 2 to thread 129 execute update t set C = C + 1, where id = 1; due to the wait row lock, it enters the waiting state. In this way, 128 threads are waiting;
  3. If the number of threads in the lock waiting state is not decreased by one, InnoDB will think that the number of threads is full, and it will prevent other statements from entering the engine for execution, so that thread 1 cannot submit transactions. The other 128 threads are in the lock waiting state, and the whole system is blocked.

Figure 2 shows this state.

Summary after reading MySQL (3)
At this time, InnoDB cannot respond to any request, and the whole system is locked. Moreover, because all threads are in the waiting state, the CPU occupied at this time is 0, which is obviously unreasonable. Therefore, it is reasonable and necessary for InnoDB to reduce the number of concurrent threads by 1 when the process enters the lock waiting situation.

Although the thread waiting for lock is not included in the concurrent thread count, if it is actually executing a query, for example, the select sleep (100) from t in the first three transactions in the above example should be included in the concurrent thread count.

In this example, the statements executed at the same time exceed the set InnoDB_ thread_ The value of concurrency. At this time, the system is no longer working. However, if you use select 1 to detect the system, you will think that the system is still normal.

Therefore, the judgment logic of using select 1 needs to be modified.

Look up the table to judge

In order to detect the system unavailability caused by too many concurrent threads in InnoDB, we need to find a scenario to access InnoDB. The general approach is to create a table in the system library (MySQL Library), such as health_ Check, in which only one line of data is put, and then it is executed regularly

mysql> select * from mysql.health_check; 

Using this method, we can detect the database unavailability caused by too many concurrent threads.

However, we will encounter the next problem immediately, that is, when the space is full, this method will become difficult to use.

The update transaction needs to write binlog, and once the space occupation rate of the disk where binlog is located reaches 100%, all update statements and commit statements submitted by the transaction will be blocked. However, the system can read data normally at this time.

So I’d better improve this monitoring statement. Next, let’s take a look at the effect of changing the query statement into an update statement.

Update judgment

Since you want to update, you need to put a meaningful field. A common practice is to put a timestamp field to indicate the time of the last detection. This update statement is similar to:

mysql> update mysql.health_check set t_modified=now();

The detection of node availability should include primary database and standby database. If you use the update to detect the primary database, then the standby database should also perform the update detection.

However, binlog is also used to detect the standby database. Since the master-slave relationship between database a and B is usually designed as a dual-m structure, the detection command executed on standby database B should also be sent back to master database a.

However, if both primary database a and standby database B use the same update command, row conflict may occur, which may cause the primary and standby synchronization to stop. So now it seems that mysql.health_ Check this table can’t have only one row of data.

In order to avoid conflicts between the active and standby updates, the mysql.health_ Check table to store multiple lines of data, and use a, B server_ ID is the primary key.

mysql> CREATE TABLE `health_check` (
  `id` int(11) NOT NULL,
  `t_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)

/*Test command*/
insert into mysql.health_check(id, t_modified) values (@@server_id, now()) on duplicate key update t_modified=now();

Because MySQL specifies the server of the main database and the standby database_ The ID must be different (otherwise, an error will be reported when creating the active standby relationship), so that the detection commands of the active and standby databases will not conflict.

Update judgment is a relatively common scheme, but there are still some problems. Among them, “slow judgment” has always been a headache for DBAs.

If the update statement fails or times out, the active / standby switch can be initiated. Why is the judgment slow?

This involves the issue of server IO resource allocation.

First of all, all detection logic needs a timeout n. If an update statement is executed and does not return after more than N seconds, the system is considered unavailable.

Imagine a scenario where the IO utilization of the log disk is already 100%. At this time, the response of the whole system is very slow, and the active / standby switching is needed.

The IO utilization rate of 100% indicates that the IO of the system is working, and each request has the opportunity to obtain IO resources and perform its own tasks. The update command used in the detection needs few resources, so it may be submitted successfully when I get the IO resources, and it will be returned to the detection system before the timeout of N seconds.

Once the detection system saw that the update command did not time out, it came to the conclusion that the system was normal.

In other words, at this time, the normal SQL statements on the business system have been executed very slowly, but the DBA has seen that the HA system is still working normally, and thinks that the main database is now available.

The root cause of this phenomenon is that all the methods mentioned above are based on external detection. A natural problem of external detection is randomness.

Because the external detection needs regular polling, the system may have a problem, but it is not possible to find the problem until the next detection initiates the execution statement. Moreover, if you are not lucky enough, you may not find it in the first polling, which will lead to slow handoff.

Internal statistics

For the problem of disk utilization, if MySQL can tell us the time of each internal IO request, then our method of judging whether there is a problem with the database is much more reliable.

In fact, MySQL 5.6 version provides performance_ Schema library, in the file_ summary_ by_ event_ The time of each IO request is counted in the name table.

file_ summary_ by_ event_ There are many rows of data in the name table. Let’s take a look at event first_ name=’wait/io/file/innodb/innodb_ log_ ‘file’.

Summary after reading MySQL (3)

This row in the figure shows the write time of the redo log. The first column is event_ Name indicates the type of statistics.

The next three sets of data show the time statistics of redo log operation.

The first group of five columns is the statistics of all IO types. Among them, count_ Star is the total number of all IO, and the next four columns are the specific statistical items in picoseconds. The prefixes sum, min, AVG and Max, as the name implies, refer to the sum, minimum, average and maximum.

The second group of six columns is the statistics of read operations. Last column sum_ NUMBER_ OF_ BYTES_ Read counts the total number of bytes read from the redo log.

The third group, six columns, counts write operations.

The last group of data is the statistics of other types of data. In redo log, we can think of them as statistics of fsync.

In performance_ The file of schema Library_ summary_ by_ event_ In the name table, binlog corresponds to event_ Name = wait / Io / file / SQL / binlog. The statistical logic of each field is exactly the same as that of redo log. Here, I will not repeat it.

Because every time you operate the database, performance_ Schema requires additional statistics of this information, so there is a performance loss when the statistics function is turned on.

The test result is, if you open all performance_ Schema item, the performance will decrease about 10%. Therefore, it is recommended to open only the items you need for statistics. You can open or close the statistics of a specific item through the following methods.

If you want to turn on time monitoring of redo log, you can execute this statement:

mysql> update setup_instruments set ENABLED='YES', Timed='YES' where name like '%wait/io/file/innodb/innodb_log_file%';

Suppose that two statistics, redo log and binlog, have been turned on, how can this information be used for instance state diagnosis?

It’s very simple. You can use max_ The value of timer is used to judge whether there is a problem in the database. For example, you can set a threshold. If a single IO request takes more than 200 milliseconds, it is an exception. Then use a statement like the following as the detection logic.

mysql> select event_name,MAX_TIMER_WAIT  FROM performance_schema.file_summary_by_event_name where event_name in ('wait/io/file/innodb/innodb_log_file','wait/io/file/sql/binlog') and MAX_TIMER_WAIT>200*1000000000;

After finding the exception, get the required information, and then use the following statement:

mysql> truncate table performance_schema.file_summary_by_event_name;

Clear the previous statistics. In this way, if this exception occurs again in later monitoring, the monitoring cumulative value can be added.

Looking at locking from a dynamic point of view

Locking rule: this rule contains two “principles”, two “optimizations” and one “bug”:

  • Principle 1: the basic unit of locking is next key lock. I hope you remember that the next key lock is the front opening and back closing interval.
  • Principle 2: only the objects accessed in the search process will be locked.
  • Optimization 1: for the equivalent query on the index, when locking the unique index, the next key lock degenerates to row lock.
  • Optimization 2: for the equivalent query on the index, when traversing to the right and the last value does not meet the equivalence condition, the next key lock degenerates into a gap lock.
  • A bug: a range query on a unique index will access the first value that does not meet the condition.

Based on table t:

  `id` int(11) NOT NULL,
  `c` int(11) DEFAULT NULL,
  `d` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `c` (`c`)

insert into t values(0,0,0),(5,5,5),

Equivalent query in unequal sign condition

What’s the difference between equivalent query and traversal?

Take an example to analyze the locking range of this query statement

select * from t where id>9 and id<12 order by id desc for update;

Using the above locking rules, we know that the locking range of this statement is (0,5], (5,10) and (10,15) on the primary key index. In other words, the line id = 15 is not locked. Why?

The locking unit is the next key lock, which is the front open and back closed interval. However, optimization 2 is used here, that is, the equivalent query on the index. When traversing to the right, id = 15 does not meet the condition, so the next key lock degenerates into a gap lock (10, 15).

But where does the “equivalent query” come from when the “where” condition in the query statement is greater than sign and less than sign?

You know, the locking action occurs in the process of statement execution, so when analyzing the locking behavior, we should start from the data structure on the index.

Diagram of index ID of table:

Summary after reading MySQL (3)

  1. First of all, the semantics of this query statement is order by ID desc. to get all the rows that meet the conditions, the optimizer must first find “the first value with ID < 12”.
  2. This process is obtained through the search process of the index tree. Inside the engine, it is actually to find the value of id = 12, but it is not found in the end, but the gap of (10,15) is found.
  3. Then traverse to the left. In the process of traversal, it is not an equivalent query. The line id = 5 will be scanned, so a next key lock (0,5) will be added.

In other words, in the process of execution, when locating records through tree search, the method of “equivalent query” is used.

The process of equivalent query

What is the locking range of this statement?

select id from t where c in(5,20,10) lock in share mode;

In is used in this query statement. Let’s first look at the explain result of this statement.

Summary after reading MySQL (3)

As you can see, this in statement uses the index C and rows = 3, which indicates that the three values are located by B + tree search.

When searching for C = 5, lock (0,5) first. However, because C is not the only index, in order to confirm whether there are other records with C = 5, we need to traverse to the right and find C = 10 to confirm that there are no other records. This process satisfies optimization 2, so we add a gap lock (5,10).

Similarly, when C = 10 is executed, the locking ranges are (5,10] and (10,15); when C = 20 is executed, the locking ranges are (15,20] and (20,25).

This statement adds three record locks on Index C in the following order: first add the record lock of C = 5, then add the record lock of C = 10, and finally add the record lock of C = 20.

This locking range is to remove the row lock with C = 15 from (5,25)? Why is it so troublesome to say in sections?

  • Because these locks are “added one by one during execution”, rather than added at one time.

Analyze the deadlock problem in the following example:

select id from t where c in(5,20,10) order by c desc for update;

What is the locking range at this time?

The gap lock is not interlocked, but these two statements will add the record lock on the three lines of C = 5, 10 and 20 on the index C.

Because the statement is order by C DESC, the locking order of the three record locks is C = 20, then C = 10, and finally C = 5.

In other words, these two statements need to lock the same resource, but the locking order is opposite. When these two statements are executed concurrently, deadlock may occur.

About deadlock information, MySQL only keeps the scene of the last deadlock, but this scene is still incomplete.

What do you think of deadlock?

Figure 3 shows the partial output of the show engine InnoDB status command after deadlock. This command will output a lot of information. There is a section named latestdetected deadlock, which is the last deadlock information recorded.

Summary after reading MySQL (3)

Let’s take a look at some of the key messages in this picture.

  1. This result is divided into three parts
    • (1) Transaction is the information of the first transaction;
    • (2) Transaction is the information of the second transaction;
    • We roll back transaction (1) is the final processing result, indicating that the first transaction has been rolled back.
  2. In the information of the first transaction:
    • Waiting for the lock to be granted indicates the lock information that the transaction is waiting for;
    • index c of table test.t, indicating that the lock on Index C of table t is waiting;
    • Lock mode s waiting indicates that the statement needs to add a read lock by itself. The current state is waiting;
    • Record lock indicates that this is a record lock;
    • n_ Fields 2 means that the record has two columns, namely field C and primary key field ID;
    • 0: len 4; hex 0000000a; ASC;; is the first field, that is, C. The value is hexadecimal a, which is 10;
    • 1: Len 4; hex 0000000a; ASC;; is the second field, which is the primary key ID, and the value is also 10;
    • The ASC in these two lines indicates that the “printable character” in the value will be printed next, but 10 is not a printable character, so the space will be displayed.
    • The first transaction information only shows the status of waiting for the lock, waiting for the lock of the line (C = 10, id = 10).
    • Of course, you know that since there is a deadlock, it means that the transaction also owns other locks, but it is not displayed. Don’t worry, we derive the information from the second transaction.
  3. The second transaction displays more information:
    • “Holds the lock (s)” is used to display which locks the transaction holds;
    • index c of table test.tIndicates that the lock is on Index C of table t;
    • Hex 0000000a and hex 00000014 indicate that the transaction holds two record locks C = 10 and C = 20;
    • Waiting for the lock to be granted, which means waiting for the record lock (C = 5, id = 5).

From the above information, we know:

  1. The statement of “lock in share mode” holds the record lock of C = 5 and is waiting for the lock of C = 10;
  2. The statement “for update” holds the record lock of C = 20 and C = 10, while waiting for the record lock of C = 5.

So it leads to deadlock

  1. Since locks are added one by one, deadlock should be avoided, and the same group of resources should be accessed in the same order as far as possible;
  2. At the time of deadlock, the statement for update occupies more resources and costs more to roll back. Therefore, InnoDB chooses the lock in share mode statement with lower cost to roll back.

How to look at lock waiting?

Summary after reading MySQL (3)

As you can see, session a does not lock the record C = 10, so Session B can delete the line id = 10. But after that, Session B can’t go back to the line insert id = 10.

Let’s take a look at the result of show engine InnoDB status to see if we can give some hints. Lock information is in the transactions section of the command output.

Summary after reading MySQL (3)

Here are a few key messages.

  • index PRIMARY of table test.tIndicates that the statement is locked because of a lock on the primary key of table t.
  • lock_ Mode x locks gap before rec insert intention waiting
    • Insert intent indicates that the current thread is ready to insert a record, which is an insert intent lock. For the sake of understanding, you can think of it as the insertion itself.
    • Gap before rec indicates that this is a gap lock, not a record lock.
  • So which record precedes this gap? The next five lines from 0 to 4 are the information of this record.
  • n_ Fields 5 also indicates that this record has five columns:
    • 0: len 4; hex 0000000f; ASC;; the first column is the primary key ID field, and the hexadecimal f is id = 15. So, at this time, we know that the gap is before id = 15, because id = 10 no longer exists, which means (5,15).
    • 1: Len 6; hex 000000000513; ASC;; the second column is the transaction ID with a length of 6 bytes, which indicates that the transaction with TRX ID 1299 was last modified in this row.
    • 2: Len 7; hex b0000001250134; ASC% 4;; the third column is 7 bytes of rollback segment information. As you can see, the ACS here is followed by the display content (%) and 4), because this byte happens to be a printable character.
    • The last two columns are the values of C and D, both 15.

Because the delete operation removes the line id = 10, the original two gaps (5,10), (10,15) become one (5,15).

Here, we can think about the relationship between the two phenomena

  1. After session a executes the select statement, it doesn’t do anything, but its locking scope suddenly “becomes larger”;
  2. When we execute select * from t where C > = 15 and C < = 20 order by C desc lock in share mode; scan left to C = 10, lock (5, 10].

In other words, the so-called “gap” is actually defined by “the record on the right side of this gap”.

An example of update

Let’s take another example of an update statement

Summary after reading MySQL (3)

You can analyze by yourself. The locking range of session a is (5,10], (10,15], (15,20], (20,25) and (25, maximum) on index C.

Note: according to C > 5, the first record found is C = 10, so the next key lock (0,5) will not be added.

After that, the first update statement of Session B, to change C = 5 to C = 1, can be understood as two steps:

  1. Insert the record (C = 1, id = 5);
  2. Delete (C = 5, id = 5) this record.

The (5,10) gap on Index C is defined by the record to the right of the gap, that is, C = 10. Therefore, through this operation, the locking range of session a becomes as shown in Figure 7

Summary after reading MySQL (3)

Next, Session B needs to execute the statement update t set C = 5 where C = 1. Similarly, it can be divided into two steps:

  1. Insert the record (C = 5, id = 5);
  2. Delete (C = 1, id = 5) this record.

The first step is trying to insert data into (1,10) which has been added clearance lock, so it is blocked.

Question 1:
In the deadlock log, lock_ Mode x waiting is clearance lock + row lock_ Mode x locks rec but not gap?

  • lock_ Mode x waiting means next key lock;
    lock_ Mode x locks rec but not gap;
    There is also a kind of “locks gap before REC”, that is, there is only gap lock;

Question 2:
An empty table has only one gap. For example, on an empty table:

select * from t where id>1 for update;

The locking range of this query statement is next key lock (- ∞, maximum].

The validation method can use the following sequence of operations. You can see the results displayed in Figure 4.

Summary after reading MySQL (3)

Summary after reading MySQL (3)

What else can we do after deleting data by mistake?

There are several categories of MySQL related deleted data

  1. Use delete statement to delete data row by mistake;
  2. Use drop table or truncate table statement to delete data table by mistake;
  3. The drop database statement is used to delete the database by mistake;
  4. Use the RM command to delete the entire MySQL instance by mistake.

Delete line by mistake

If the data row is deleted by mistake with the delete statement, the flashback tool can be used to recover the data by flashback.

The principle of flashback data recovery is to modify the content of binlog and retrieve the original database for replay. The premise of using this scheme is to ensure binlog_ Format = row and binlog_ row_ image=FULL。

When recovering data, the following processing is performed for a single transaction:

  1. For insert statements, the corresponding binlog event type is write_ Rows event, change it to delete_ Rows event;
  2. Similarly, for the delete statement, the delete_ Change rows event to write_ rows event;
  3. And if it’s update_ For rows, binlog records the values of the data row before and after modification, and you can swap the positions of the two rows.

What happens if there are more than one misoperation? For example, the following three transactions:

(A)delete ...
(B)insert ...
(C)update ...

Now we need to restore the database to the state before the three transaction operations. After parsing the binlog with flashback tool, the command to write back to the main database is as follows:

(reverse C)update ...
(reverse B)delete ...
(reverse A)insert ...

That is to say, if multiple transactions are involved in data deletion, the transaction order should be transferred before execution.

It is not recommended to perform these operations directly on the main library.

The safer way to recover data is to recover a backup, or find a slave library as a temporary library, perform these operations on the temporary library, and then recover the confirmed data of the temporary library back to the main library.

Because, in a main database that executes online logic, the change of data state is often related. It may be that the discovery of data problems is a little late, which leads to the business code logic continuing to modify other data on the basis of previous misoperation. Therefore, if these lines of data are recovered separately at this time without confirmation, there may be secondary damage to the data.

It is more important to prevent in advanceThere are two suggestions:

  1. Put SQL_ safe_ The updates parameter is set to on. In this way, if we forget to write the where condition in the delete or update statement, or the where condition does not contain the index field, the execution of this statement will report an error.
  2. Before the code goes online, it must be audited by SQL.

Set SQL_ safe_ If you really want to delete all the data in a small table, what should you do?

If the deletion is OK, you can add the where condition in the delete statement, such as where id > = 0.

However, deleting the whole table is very slow. You need to generate rollback logs, write redo, and write binlog. Therefore, from a performance perspective, you should give priority to using the truncate table or drop table command.

The data deleted by the delete command can also be recovered by flashback. However, the data deleted by using the truncate / drop table and drop database commands cannot be recovered by flashback. Why?

This is because even if binlog is configured_ Format = row. When the three commands are executed, the binlog is still in the statement format. There is only one truncate / drop statement in binlog, which can’t recover data.

Delete database / table by mistake

In this case, if you want to recover data, you need to use full backup and incremental log. This solution requires regular full backup online and real-time backup of binlog.

If these two conditions are met, if someone accidentally deleted a database at 12:00 noon, the recovery process is as follows:

  1. Take the latest full backup, assuming that the database is one backup per day, and the last backup is at 0:00 on the same day;
  2. Use backup to restore a temporary library;
  3. From the log backup, take out the log after 0 a.m;
  4. These logs are applied to the temporary library except for the statement of deleting data by mistake.

The flow chart is as follows:

Summary after reading MySQL (3)

This process is described as follows:

  1. In order to speed up data recovery, if there are multiple databases in the temporary database, you can use the mysqlbinlog command to add a – database parameter to specify the database where the tables are deleted by mistake. In this way, we can avoid applying other library logs when recovering data.
  2. When applying the log, you need to skip the binlog of the statement with 12 misoperations
    • If the original instance does not use the gtid mode, it can only use the – stop position parameter to execute the log before the misoperation when it is applied to the binlog file containing 12 points, and then use the – start position to continue the execution from the log after the misoperation;
    • If the instance uses the gtid mode, it is much more convenient. Assuming that the gtid of the misoperation command is gtid1, you only need to execute set gtid_ Next = gtid1; begin; commit; add this gtid to the gtid set of the temporary instance first, and then when binlog is executed in sequence, the statement of misoperation will be automatically skipped.

However, even so, using mysqlbinlog method to recover data is not fast enough. There are two main reasons

  1. If you delete a table by mistake, you’d better restore the table only, that is, replay the table only. However, mysqlbinlog tool can’t specify that only one table’s log should be parsed;
  2. Using mysqlbinlog to parse log application, the process of log application can only be single thread.

One way to speed it up is,After the temporary instance is recovered from backup, set the temporary instance as the slave of the online standby library

  1. Before start slave, first execute the
    change replication filter replicate_ do_ table = (tbl_ Name) command, you can make the temporary library synchronize only the misoperated tables;
  2. In this way, parallel replication technology can also be used to speed up the whole data recovery process.

The schematic diagram of this process is as follows:

Summary after reading MySQL (3)

In the figure, there is a dotted line from the binlog backup system to the online standby database, which means that if the binlog required by the temporary instance has been deleted on the standby database due to too long time, we can find the required binlog from the binlog backup system and put it back into the standby database.

Suppose we find that the binlog required by the current temporary instance starts from master.00000 5, but the smallest binlog file displayed by show binlogs on the standby database is master.00000 7, which means that two binlog files are missing. At this time, you need to go to the binlog backup system to find these two files.

The operation steps of putting the deleted binlog back to the standby database are as follows:

  1. Download master.000005 and master.000006 from the backup system and put them in the log directory of the backup database;
  2. Open the master.index File, add two lines at the beginning of the file, the contents are “.” / master. 00000 5 “and”. “/ master. 00000 6”;
  3. The purpose of restarting the standby database is to make the standby database recognize the two log files again;
  4. Now the standby database has all the binlogs needed by the temporary database. After establishing the master-slave relationship, you can synchronize normally.

Whether the binlog file parsed by mysqlbinlog tool is applied to the temporary library, or the temporary library is connected to the standby library, the common point of these two schemes is: after deleting the library or table by mistake, the idea of restoring data is mainly through backup, plus the application of binlog.

In other words, both schemes require the backup system to back up the full amount of logs on a regular basis, and ensure that the binlog has been backed up before it is deleted locally.

However, it is impossible for a system to back up unlimited logs. It is also necessary to set a number of days to keep the logs according to the cost and disk space resources. If your DBA team tells you that an instance can be restored to any point in half a month, it means that the backup system will keep the log for at least half a month.

In addition, it is suggested that no matter which method is used, the data recovery function should be made into an automatic tool and practiced frequently. Why do you say that?

There are two reasons

  1. Although “we don’t want this kind of thing to happen”, in case of accidental deletion, we can recover the data quickly and minimize the loss, so we don’t have to run.
  2. And if the temporary manual operation again in a hurry, and finally the wrong operation, causing secondary damage to the business, it can not be justified.

Delayed replication backup

By using parallel replication to speed up the process of data recovery, but this scheme still has the problem of “uncontrollable recovery time”.

If the backup of a library is particularly large, or the time of misoperation is longer than that of the previous full backup, for example, if the misoperation occurs on the 6th day after the backup of an instance that is prepared once a week, the log of 6 days needs to be recovered, and the recovery time may be calculated by day.

So, what can we do to shorten the time required to recover data?

If you have a very core business and don’t allow too long recovery time, you can consider itBuild a backup database for delayed replication.This function was introduced in MySQL version 5.6.

The problem with the general primary and standby replication structure is that if a table in the primary database is deleted by mistake, this command will be sent to all the secondary databases very soon, and then all the data tables in the secondary database will be deleted by mistake.

The backup database of delayed replication is a special backup database_ The delay = n command can specify that the standby database has a delay of N seconds with the primary database.

For example, if you set n to 3600, it means that if the data in the primary database is deleted by mistake and the misoperation command is found within 1 hour, the command has not been executed in the standby database with delayed replication. At this time, execute stop slave on the standby database, and then skip the misoperation command through the method described before, and the required data can be recovered.

In this way, you can get one at any time. You can recover the temporary instance of the data in one hour at most, which shortens the time needed for the whole data recovery.

Methods to prevent database / table from being deleted by mistake

The first suggestion is to separate accounts. The purpose of this is to avoid writing the wrong command. For example:

  • Only DML permission is given to business development students, but not truncate / drop permission. If business developers have DDL requirements, they can also get support through the development management system.
  • Even members of the DBA team are required to use only read-only accounts in their daily life, and only accounts with update permission are used when necessary.

The second suggestion is to develop operational norms. The purpose of this is to avoid writing the wrong table name to be deleted. For example:

  • Before deleting a data table, you must rename it. Then, observe for a period of time to ensure that there is no impact on the business, and then delete this table.
  • When changing the table name, it is required to add a fixed suffix to the table name (such as adding_ to_ be_ The action of deleting a table must be performed by the management system. Moreover, when the management department delete a table, it can only delete the fixed suffix table.

RM delete data

In fact, for a MySQL Cluster with high availability mechanism, the most fearless thing is RM deleting data. As long as the whole cluster is not maliciously deleted, but only the data of one of the nodes is deleted, HA system will start to work and select a new main library to ensure the normal operation of the whole cluster.

At this time, what we need to do is to recover the data on this node, and then access the whole cluster.

Of course, now it’s not only DBA that has automation system, but SA (system administrator) also has automation system, so maybe a batch offline machine operation will destroy all nodes of your entire MySQL Cluster.

To deal with this situation, the only suggestion is to try to save your backup across computer rooms, or preferably across cities.

Why are there any sentences that can’t be killed?

There are two kill commands in MySQL: one is kill query + thread ID, which means to terminate the statement being executed in the thread; the other is kill connection + thread ID, where connection can be default, which means to disconnect the thread. Of course, if the thread has a statement being executed, it also needs to stop the statement being executed first.

There is a phenomenon: the kill command is used, but the connection is not broken. Then execute the show processlist command, and you can see that the command column of this statement displays killed.

What does it mean to display as killed? Shouldn’t this thread be directly invisible in the result of show processlist?

  • In fact, in most cases, the kill query / connection command is effective. For example, in the process of executing a query, if we find that the execution time is too long and want to give up the query, we can use the kill query command to terminate the query statement.
  • In another case, when the statement is in lock waiting, it is also effective to use the kill command directly. Let’s take a look at this example

Summary after reading MySQL (3)

It can be seen that after session C executes kill query, Session B prompts that the statement is interrupted almost at the same time, which is the expected result.

After receiving the kill, what does the thread do?

Is Session B to terminate the thread directly and exit directly regardless of everything? Obviously, this can’t be done.

When adding, deleting, modifying and querying a table, MDL read lock will be added to the table. Therefore, although Session B is in the blocked state, it still holds an MDL read lock. If the thread is killed, it will be terminated directly, and then the MDL read lock will not be released.

In this way, kill does not mean to stop immediately. Instead, it tells the executing thread that this statement does not need to be executed any more and that it can start “execution stop logic”.

In fact, this is similar to the Linux Kill Command. Kill – N PID does not stop the process directly, but sends a signal to the process, and then the process processes the signal to enter the termination logic. It’s just that for MySQL’s Kill Command, you don’t need to transmit semaphore parameters, just the “stop” command.

In implementation, when the user executes kill query thread_ id_ B, the thread that handles the Kill Command in MySQL does two things:

  1. Change the running state of Session B to thd:: kill_ Query (assign the variable killed to thd:: Kill_ QUERY);
  2. Send a signal to the execution thread of session B.

Why signal?

  • Because in the example in Figure 1, Session B is in the lock waiting state. If you just set the thread state of Session B to thd:: kill_ Query. Thread B doesn’t know about this state change and will continue to wait. The purpose of sending a signal is to let Session B quit waiting to process the thd:: kill_ Query status.

There are three meanings implied in the above analysis

  1. There are many “buried points” during the execution of a statement. The thread state can be judged at these “buried points”. If the thread state is found to be thd:: kill_ Query to start the statement termination logic;
  2. If it is in the waiting state, it must be a waiting that can be awakened, otherwise it will not be executed to the “buried point” at all;
  3. There is a process from the beginning of a statement to the completion of the termination logic.

Let’s take another example (InnoDB)_ thread_ Concurrency is not enough:

First, set global InnoDB is executed_ thread_ Concurrency = 2, set the maximum number of concurrent threads in InnoDB to 2, and then execute the following sequence:

Summary after reading MySQL (3)

We can see that:

  1. Session C is blocked during execution;
  2. But the kill query C command executed by session D has no effect,
  3. It is not until session e executes the kill connection command that session C is disconnected, prompting “lost connection to MySQL server during query”,
  4. But at this time, if you execute show processlist in session e, you can see the following figure.

Summary after reading MySQL (3)

At this time, the commnad column of the thread with id = 12 shows killed. In other words, although the client is disconnected, the statement on the server is still in the process of execution.

Why does this statement not exit like the update statement in the first example when the kill query command is executed?

In the implementation, pthread is used when waiting for row lock_ cond_ Timedwait function, the waiting state can be awakened. However, in this example, the waiting logic of thread 12 is as follows: every 10 milliseconds, determine whether it can enter InnoDB execution, if not, call the nanosleep function to enter the sleep state.

That is to say, although the state of thread 12 has been set to kill_ Query, but in the process of waiting for InnoDB to enter the loop, it does not judge the state of the thread, so it will not enter the termination logic phase at all.

This is what session e does when it executes the kill connection command,

  1. Set the state of thread 12 to kill_ CONNECTION;
  2. Turn off the network connection of thread 12. Because of this operation, you will see that session C receives the prompt to disconnect.

Why do you see the command column as killed when you execute show processlist? In fact, there is a special logic when executing show processlist

If the state of a thread is kill_ Connection, the command column is displayed as killed.

In fact, even if the client exits, the thread is still waiting. When will this thread exit?

  • Only when the conditions for entering InnoDB are met, the query statement of session C will continue to execute, and then it is possible to judge that the thread state has become kill_ Query or kill_ Connection, and then enter the termination logic phase.

This example is the first case where kill is invalid, that is, the thread does not execute the logic to determine the thread state.In the same case, due to the excessive IO pressure, the function of reading and writing IO cannot return all the time, resulting in the failure to judge the state of the thread in time.

The other is that the termination logic takes a long time.At this time, the result of show processlist is also command = killed. You need to wait until the termination logic is completed before the statement is really completed. In this case, the more common scenarios are as follows:

  1. Is killed during the execution of a large transaction. At this time, the rollback operation needs to recycle all the new data versions generated during the transaction execution, which takes a long time.
  2. Large query rollback. If large temporary files are generated in the query process, and the pressure of the file system is high at this time, deleting temporary files may need to wait for IO resources, resulting in a long time-consuming.
  3. In the final stage of DDL command execution, if it is killed, the temporary files in the intermediate process need to be deleted. It may also take a long time due to IO resources.

On the client side, you can’t terminate the thread directly by using the command Ctrl + C.

There is a misunderstanding here. In fact, the operation on the client can only operate on the thread of the client. The client and the server can only interact through the network, and it is impossible to directly operate the thread of the server.

Since MySQL is a stop wait protocol, it is useless to continue sending commands to this connection before the statement executed by this thread returns. In fact, when you execute Ctrl + C, the MySQL client starts another connection and sends a kill query command.

Therefore, to kill a thread, many back-end operations are involved.

Two other misconceptions about clients

The first misconception is that if there are too many tables in the library, the connection will be slow.

Some online libraries contain many tables. At this time, you will find that every time you connect with a client, you will be stuck in the interface below.

Summary after reading MySQL (3)

If there are few tables in the db1 library, the connection will be very fast, and you can enter the state of input command very quickly. Therefore, some students think that the number of tables affects the connection performance.

When each client establishes a connection with the server, all it needs to do is TCP handshake, user verification, and access. But these operations obviously have nothing to do with the number of tables in the library.

But in fact, as the text prompt in the figure says, when using the default parameters to connect, the MySQL client will provide a function to complete the local database name and table name. In order to realize this function, the client needs to do more operations after connecting successfully

  1. Execute show databases;
  2. Switch to the db1 library and execute show tables;
  3. Use the results of these two commands to build a local hash table.

Among these operations, the most time-consuming one is the third step to build a hash table locally. Therefore, when there are many tables in a library, this step will take a long time.
in other words,The perceived slow connection process is not slow connection, nor slow server, but slow client.

The prompt in the figure also says that if you add – A to the connection command, you can turn off the automatic completion function, and then the client can quickly return.

Here, the effect of automatic completion is that when you enter the database name or table name, you can use the tab key to automatically complete the table name or display a prompt.

In actual use, if the automatic completion function is not used much, it is recommended to add – A by default every time.

In fact, the prompt does not say that in addition to adding – A, adding the – quick (or abbreviated as – Q) parameter can also skip this stage. But, this one– quick is a parameter that is more likely to cause misunderstanding, and it is also a common misunderstanding about the client.

Setting this parameter may reduce the performance of the server. It is not a parameter to speed up the server. Why do you say that?

After the MySQL client sends the request, there are two ways to receive the result returned by the server:

  1. One is the local cache, which is to open a piece of memory locally and store the results first. If you use API development, the corresponding is mysql_ store_ Result method.
  2. The other is not cache, read one and process one. If you use API development, the corresponding is mysql_ use_ Result method.

The MySQL client uses the first method by default, and if the – quick parameter is added, the second method without caching will be used.

If the local processing is slow, the server’s sending result will be blocked, so the server will slow down.

In this case, why call this parameter quick? This is because the following three effects can be achieved by using this parameter:

  • First, skip the automatic completion function of table name.
  • Second, mysql_ store_ Result needs to apply for local memory to cache query results. If the query result is too large, it will consume more local memory, which may affect the performance of the local machine of the client;
  • Third, the execution command will not be recorded in the local command history file.

So the – quick parameter means to make the client faster.

Question 1:If a transaction continues to be rolled back after being killed, from the perspective of recovery speed, should you restart and wait for its execution to end, or should you forcibly restart the entire MySQL process?

Because the rollback action after restart is indispensable, so from the perspective of recovery speed, it should be allowed to end by itself.

Of course, if this statement may occupy other locks, or because it occupies too much IO resources, it will affect the execution of other statements, you need to switch between active and standby first, and switch to the new primary database to provide services.

After switching, other threads are disconnected, and the execution stops automatically. The next step is to wait for it to execute itself.

Big data query, will the database memory burst?

The host memory is only 100g. If you want to scan a 200g large table, will the memory of the database host be used up?

It’s really worth worrying about. It’s no joke to be OO (out of memory) by the system. However, on the other hand, when we think about logical backup, we should scan the whole database? If so, the memory will be eaten up. Isn’t the logical backup long gone?

Therefore, it seems that there should be no problem to do a full table scan on the large table. But what is the process like?

The influence of full table scan on server layer

Suppose you want to perform a full table scan on a 200g InnoDB table db1. T. Of course, if you want to save the scan results in the client, you will use a command like this:

mysql -h$host -P$port -u$user -p$pwd -e "select * from db1.t" > $target_file

InnoDB data is stored in the primary key index, so full table scan is actually a direct scan of the primary key index of table t. This query statement has no other judgment conditions, so every row found can be directly put into the result set, and then returned to the client.

So, where does this “result set” exist?

In fact, the server does not need to save a complete result set. The process of fetching and sending data is as follows:

  1. Get a line and write it to net_ Buffer. The size of this memory is determined by the parameter net_ buffer_ Length is defined, and the default is 16K.
  2. Get rows repeatedly until net_ When the buffer is full, call the network interface to send it out.
  3. If the transmission is successful, clear net_ Buffer, and then continue to take the next line and write to net_ buffer。
  4. If the sending function returns eagain or wsaewouldblock, it means that the local network stack (socket send buffer) is full and waiting. Continue sending until the network stack is writable again.

The corresponding flow chart is as follows:

Summary after reading MySQL (3)

From this process, you can see that:

  1. In the process of sending a query, the largest amount of memory in MySQL is net_ buffer_ If the length is so large, it will not reach 200g;
  2. The socket send buffer cannot reach 200g (default definition / proc / sys / net / core / wmem)_ If the socket send buffer is full, the process of reading data will be suspended.

in other words,MySQL is “sending while reading”The concept is very important. This means that if the client receives slowly, the MySQL server will take longer to execute the transaction because the result cannot be sent out.

For example, in the following state, the client intentionally does not read the contents in the socket receive buffer, and then shows the results in the processlist on the server.

Summary after reading MySQL (3)

If you see that the value of state is always in the“Sending to client”This means that the network stack on the server side is full.

If the client uses the – quick parameter, MySQL will be used_ use_ Result method. This method is to read one line and process one line. You can imagine that the logic of a business is relatively complex. If the logic to be processed after reading a line of data is very slow, it will take a long time for the client to fetch the next line of data. This may happen as shown in Figure 2.

Therefore,For normal online business, if a query does not return many results, it is recommended that you use mysql_ store_ The result interface directly saves the query results to the local memory.

Of course, the premise is that the query does not return many results (example: if a large query is executed and the client occupies nearly 20g of memory, in this case, you need to use MySQL instead_ use_ Result interface).

On the other hand, if you see that many threads are in the “sending to client” state in MySQL that you are responsible for maintaining, it means that you need to let business developers optimize the query results and evaluate whether so many returned results are reasonable.

If you want to quickly reduce the number of threads in this state, you will use net_ buffer_ Setting the length parameter to a larger value is an option.

A state very similar to “sending to client” is“Sending data”This is a problem that is often misunderstood. Case: in the instance maintained by myself, we can see that the status of many query statements is “sending data”, but there is no problem in checking the network. Why does sending data take so long?

In fact, the state change of a query statement is as follows (Note: other irrelevant states are omitted here)

  • After the MySQL query statement enters the execution phase, first set the status to “sending data”;
  • Then, the execution result column related information (meta data) is sent to the client;
  • Then continue to execute the flow of the statement;
  • After execution, set the status to an empty string.

In other words, “sending data” does not necessarily mean “sending data”, but may be at any stage in the process of the actuator. For example, you can construct a lock waiting scenario and see the sending data status.

Summary after reading MySQL (3)

Summary after reading MySQL (3)

As you can see, Session B is obviously waiting for the lock, and the status is sending data.

That is to say, only when a thread is in the state of “waiting for the client to receive the result”, it will display “sending to client”; if it is displayed as “sending data”, it only means “executing”.

Now we can see that the query results are sent to the client in sections, so scanning the whole table and returning a large amount of data will not blow up the memory.

Influence of full table scan on InnoDB

One of the functions of InnoDB memory is to save the updated results. Combined with redo log, random disk writing is avoided.

The data pages in memory are managed in buffer pool (BP). In wal, buffer pool accelerates the update. In fact, a more important function of buffer pool is to speed up queries.

Because of the wal mechanism, when a transaction is committed, the data page on the disk is old. If there is a query to read the data page, do you want to apply the redo log to the data page immediately?

  • The answer is No. Because the result of the memory data page is up-to-date, you can read the memory page directly. You see, at this time, the query does not need to read the disk at all. It takes the result directly from the memory. The speed is very fast. Therefore, buffer pool can also speed up queries.

However, the speedup effect of buffer pool depends on an important indicator, namely:Memory hit rate

You can view the current BP hit rate of a system in the show engine InnoDB status result. In general, a stable service online system, to ensure that the response time meets the requirements, the memory hit rate should be more than 99%.

When you execute show engine InnoDB status, you can see the word “buffer pool hit rate”, which shows the current hit rate. For example, the hit rate in Figure 5 is 99.0%.

Summary after reading MySQL (3)

If all the data pages needed by the query can be directly obtained from the memory, it is the best, and the corresponding hit rate is 100%. However, this is very difficult to achieve in actual production.

The size of InnoDB buffer pool is determined by the parameter InnoDB_ buffer_ pool_ Generally, it is recommended to set it to 60% ~ 80% of the available physical memory.

About ten years ago, the data volume of a single machine was hundreds of gigabytes, while the physical memory was a few gigabytes. Now, although many servers can have 128 gigabytes or more of memory, the data volume of a single machine has reached the T level.

So, InnoDB_ buffer_ pool_ It is common to have a size smaller than the amount of data on the disk. If a buffer pool is full and you want to read a data page from the disk, you must eliminate an old data page.

InnoDB memory management uses the least recently used (LRU) algorithm. The core of this algorithm is to eliminate the longest unused data.

The figure below is a basic model of LRU algorithm.

Summary after reading MySQL (3)
The LRU algorithm of InnoDB managing buffer pool is realized by linked list.

  1. In state 1 of Figure 6, the head of the linked list is P1, which means that P1 is the recently accessed data page. Suppose that there are only so many data pages in memory;
  2. At this time, there is a read request to access P3, so it becomes state 2, and P3 is moved to the front;
  3. State 3 indicates that the data page accessed this time does not exist in the linked list, so you need to apply for a new data page PX in the buffer pool and add it to the head of the linked list. But because the memory is full, we can’t apply for new memory. Then, the memory of the data page pm at the end of the linked list will be emptied, the contents of PX will be stored, and then placed in the head of the linked list.
  4. In effect, the data page PM, which has not been accessed for the longest time, has been eliminated.

This algorithm seems to have no problem, but if you consider to do a full table scan, will there be any problem?

Suppose that according to this algorithm, we want to scan a 200g table, which is a historical data table and has no business access to it.

Then, if you scan according to this algorithm, all the data in the current buffer pool will be eliminated and stored in the contents of the data page accessed in the scanning process. In other words, the buffer pool mainly contains the data of the historical data table.

This is not good for a library that is doing business services. You can see that the memory hit rate of buffer pool drops sharply, the disk pressure increases, and the SQL statement response slows down.

Therefore, InnoDB cannot directly use this LRU algorithm. In fact, InnoDB improves the LRU algorithm.

Summary after reading MySQL (3)

In InnoDB implementation, the whole LRU linked list is divided into young region and old region according to the ratio of 5:3. LRU in the figure_ Old points to the first position in the old area, which is 5 / 8 of the whole linked list. In other words, 5 / 8 near the head of the list is the young area, and 3 / 8 near the tail of the list is the old area.

The execution flow of the improved LRU algorithm is as follows.

  1. In Figure 7, state 1 is to access the data page P3. Since P3 is in the young area, like the LRU algorithm before optimization, it is moved to the head of the linked list and becomes state 2.
  2. After accessing a new data page that does not exist in the current linked list, the data page PM is still eliminated, but the newly inserted data page PX is placed in the LRU_ Old.
  3. Every time a data page in the old area is accessed, the following judgment should be made:
    • If the data page exists in the LRU linked list for more than 1 second, it will be moved to the head of the linked list;
    • If the data page exists in the LRU linked list for less than 1 second, the position remains unchanged. The time of one second is determined by the parameter InnoDB_ old_ blocks_ Time controlled. The default value is 1000, in milliseconds.

This strategy is tailored to handle operations similar to full table scanning. Or take the just scanned 200g historical data table as an example, the operation logic of the improved LRU algorithm is as follows:

  1. During the scanning process, all the data pages that need to be inserted are put into the old area;
  2. There are multiple records in a data page, and the data page will be accessed many times. However, due to sequential scanning, the time interval between the first access and the last access of the data page will not exceed 1 second, so it will still be retained in the old area;
  3. If you continue to scan the subsequent data, the previous data page will not be accessed again, so there is no chance to move to the head of the linked list (that is, the young area), and it will be eliminated soon.

As you can see, the biggest benefit of this strategy is that in the process of scanning the large table, although buffer pool is also used, it has no impact on the young area, thus ensuring the query hit rate of buffer pool in response to normal business.

Question 1:
If you see that the value of state is always “sending to client”, it means that the network stack on the server side is full. How to deal with it?

  1. Using MySQL_ store_ The result interface directly saves the query results to the local memory.
  2. Optimize the query results and evaluate whether so many returned results are reasonable.
  3. If you want to quickly reduce the number of threads in this state, you will use net_ buffer_ Setting the length parameter to a larger value is an option.
    “Sending to client” means that the network stack on the server side is full. Shouldn’t we increase the socket send buffer? With the big net_ buffer_ What does length matter? net_ buffer_ If the length is larger, but the socket send buffer is small, the network stack is still full?
  • net_ buffer_ The maximum value of length is 1g, which is larger than socket send buffer (generally several m)
  • For example, suppose that the average query result of a business is 10m (of course, if there is a problem with the business, it should be solved through the business in the end)
  • But if you put net_ buffer_ If the length is changed to 10m, there will be no “sending to client”. Although the network stack is still slow to send, but those that haven’t finished are cached in net_ In the buffer, for the executor, it is “written”.

Question 2:
If the client can not receive data due to too much pressure, what serious impact will it have on the server?

The core of this problem is that it creates “long affairs”.

As for the influence of long transaction, we should combine the knowledge of lock and mvcc.

  • If the previous statements are updated, it means that they are occupying the row lock, which will cause other statement updates to be locked;
  • Of course, there is also a problem with the read transaction, that is, the undo log cannot be recycled, resulting in the space expansion of the rollback segment.

Can I use join?

In actual production, the problems about the use of join statements are generally concentrated in the following two categories:

  1. Our DBA does not allow the use of join. What’s the problem with using join?
  2. If there are two tables with different sizes to join, which table should be used as the driver table?

Create two tables T1 and T2:

  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`)

drop procedure idata;
delimiter ;;
create procedure idata()
  declare i int;
  set i=1;
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;
delimiter ;
call idata();

create table t1 like t2;
insert into t1 (select * from t2 where id<=100)

Both tables have a primary key index ID and an index A. there is no index in field B. The stored procedure IData () inserts 1000 rows of data into table t2 and 100 rows of data into table t1.

Index Nested-Loop Join

select * from t1 straight_join t2 on (t1.a=t2.a);

If the join statement is used directly, the MySQL optimizer may choose table T1 or T2 as the driver table, which will affect the execution process of our analysis of SQL statements. Therefore, in order to facilitate the analysis of performance problems in the execution process, use straight_ Join lets MySQL use a fixed connection method to execute queries, so that the optimizer will only join in the specified way. In this statement, T1 is the drive table and T2 is the driven table.

The results of explanation were as follows

Summary after reading MySQL (3)

You can see that in this statement, there is an index in field a of driven table t2, which is used by the join process. Therefore, the execution process of this statement is as follows:

  1. Read a row of data r from table t1;
  2. From the data row R, take out field a and look it up in table t2;
  3. Take out the rows that meet the conditions in table t2 and form a row with R as a part of the result set;
  4. Repeat steps 1 to 3 until the end of the loop in table t1 ends.

This process is to traverse the table t1 first, and then according to the a value in each row of data extracted from the table t1, go to the table t2 to find the records that meet the conditions. Formally, this process is similar to nested queries when writing programs, and the index of the driven table can be used, so we call it “index nested loop join”, or NLJ for short.

The corresponding flow chart is as follows:

Summary after reading MySQL (3)

In this process:

  1. The driver table t1 is scanned with 100 rows;
  2. For each row R, according to the a field to table t2 search, is the tree search process. Because the data we construct are all one-to-one correspondence, each search process only scans one line, which is also a total of 100 lines;
  3. Therefore, the total number of scanning lines in the whole execution process is 200.

Can I use join?

If we don’t use join, we can only use single table query. Let’s take a look at the requirements of the above statement and how to realize it with single table query.

  1. implementselect * from t1To find out all the data in table t1. Here are 100 rows;
  2. Loop through the 100 rows of data:
    • Take the value of field a $R.A from each row R;
    • implementselect * from t2 where a=$R.a
    • The returned result and R form a row of the result set.

As you can see, in this query process, 200 lines were scanned, but a total of 101 statements were executed, 100 more interactions than direct join. In addition, the client has to splice SQL statements and results by itself.

Obviously, this is not as good as a direct join.

How to select driver table?

During the execution of the join statement, the driven table scans through the whole table, while the driven table searches through the tree.

Suppose the number of rows in the driven table is m. Every time you look up a row of data in the driven table, you need to search index a first, and then the primary key index. Each time a tree is searched, the approximate complexity is the logarithm of m with the base of 2, which is recorded as log2M. So the time complexity of looking up a row in the driven table is 2 * log2M。

Assuming that the number of rows in the drive table is n, the execution process needs to scan n rows in the drive table, and then for each row, match it on the driven table once.

Therefore, the approximate complexity of the whole execution process is n + n2log2M。

Obviously, n has a greater impact on the number of rows scanned, so small tables should be used as driving tables.

If you don’t think the impact is so “obvious”, you can understand it as follows: when n is expanded by 1000 times, the number of scanning lines will be expanded by 1000 times; when m is expanded by 1000 times, the number of scanning lines will be expanded by less than 10 times.

To sum up, two conclusions are drawn from the above analysis

  1. The performance of join statement is better than that of forcibly splitting into multiple single tables to execute SQL statement;
  2. If you use the join statement, you need to use the small table as the driver table.

However, it should be noted that the premise of this conclusion is “the index of the driven table can be used”.

Simple Nested-Loop Join

The SQL statement is as follows:

select * from t1 straight_join t2 on (t1.a=t2.b);

Since there is no index in field B of table t2, when we use the execution process in Figure 2, we need to do a full table scan every time we go to T2 to match.

Can we imagine this problem first, and continue to use the algorithm in Figure 2, can we get the correct result? If you only look at the results, this algorithm is correct, and this algorithm also has a name called “simple nested loop join”.

However, in this way, the SQL request will scan table t2 up to 100 times, with a total of 100 * 1000 = 100000 rows.

These are just two small tables. If T1 and T2 are tables with 100000 rows (of course, this is still within the scope of small tables), we have to scan 10 billion rows. This algorithm seems too “cumbersome”.

Of course, MySQL does not use this simple nested loop join algorithm, but uses another algorithm called “block nested loop join”, BNL for short.

Block Nested-Loop Join

At this time, there is no index available on the driven table. The flow of the algorithm is as follows:

  1. Read the data of table t1 into thread memory join_ In buffer, because we write select * in this statement, we put the whole table t1 into memory;
  2. Scan table t2, take out each row in table t2, and follow join_ Compare the data in the buffer. If the join condition is satisfied, it will be returned as part of the result set.

The flow chart of this process is as follows:

Summary after reading MySQL (3)

The explain result of this SQL statement is as follows:

Summary after reading MySQL (3)

As you can see, in this process, the tables T1 and T2 are scanned once, so the total number of rows scanned is 1100. Because of the join_ Buffers are organized in the form of unordered arrays. Therefore, for each row in table t2, you have to make 100 judgments. The total number of judgments you need to make in memory is 100 * 1000 = 100000.

If simple nested loop join algorithm is used for query, the number of scanned rows is also 100000. Therefore, in terms of time complexity, the two algorithms are the same. However, the 100000 times of judgment of block nested loop join algorithm is memory operation, which will be much faster and better.

In this case, which table should be selected as the driver table.

Suppose that the number of rows in a small table is n and the number of rows in a large table is m

  1. Both tables are scanned once, so the total number of rows scanned is m + n;
  2. The number of judgments in memory is m * n.

As you can see, there is no difference between M and N in the two formulas. Therefore, it takes the same time to select a large table or a small table as the driver table.

In this example, table t1 has only 100 rows. If table t1 is a large table, join_ What can I do if I can’t put down the buffer?

join_ The size of the buffer is determined by the parameter join_ buffer_ The default value is 256K.If you can’t put all the data in table t1, the strategy is very simple, that is to put them in segments.I put the join_ buffer_ Change the size to 1200, and then execute:

select * from t1 straight_join t2 on (t1.a=t2.b);

The execution process becomes:

  1. Scan table t1, read data rows in sequence and put them into join_ In the buffer, put the 88 line join_ The buffer is full. Continue with step 2;
  2. Scan table t2, take out each row in T2, and follow join_ Compare the data in the buffer. If the join condition is satisfied, it will be returned as part of the result set;
  3. Empty join_ buffer;
  4. Continue to scan table t1, read the last 12 rows of data in order and put them into join_ In the buffer, continue with step 2.

The execution flow chart is as follows:

Summary after reading MySQL (3)

Steps 4 and 5 in the figure show clearing the join_ Buffer reuse.

This process reflects the origin of “block” in the name of the algorithm, which means “block to join”.

As you can see, table t1 is put into join twice_ In the buffer, table t2 will be scanned twice. Although it is divided into two times to put in the join_ Buffer, but the number of times to judge the equivalence condition remains unchanged, which is still (88 + 12) * 1000 = 100000 times.

Let’s look at the selection of the driver table in this case.

Suppose that the number of data rows of the driven table is n, and the algorithm process can only be completed in K segments, and the number of data rows of the driven table is m.

Note that K is not a constant here. The larger n is, the larger K will be. Therefore, K is expressed as λ * n. obviously, the value range of λ is (0,1).

Therefore, in the implementation of this algorithm:

  1. The number of scanning lines is n + λ * n * M;
  2. Memory judgment n * m times.

Obviously, the number of memory judgments is not affected by which table is selected as the driver table. Considering the number of scanning lines, when the size of M and N is determined, n is smaller, and the result of the whole formula will be smaller.

So the conclusion is that small tables should be used as driving tables.

At this time, we will find that in the formula of N + λ * n * m, λ is the key factor affecting the number of scanning lines. The smaller the value, the better.

I just said that the larger n is, the greater the number of segments K is. So, when n is fixed, what parameters will affect the size of K? The answer is join_ buffer_ size。 join_ buffer_ The larger the size is, the more rows you can put in at a time, the less segments you can divide into, and the less times you can scan the whole table of the driven table.

That’s why you may see some suggestions: if your join statement is slow, put the join_ buffer_ It’s going to be bigger.

First question: can I use the join statement?

  1. If the index nested loop join algorithm can be used, that is, the index on the driven table can be used, it is actually no problem;
  2. If block nested loop join algorithm is used, there will be too many scanning lines. Especially the join operation on large tables, which may need to scan the driven table many times, will occupy a lot of system resources. So try not to use this kind of join.

Therefore, when judging whether to use the join statement or not, it is to see whether the word “block nested loop” appears in the extra field of the explain result.

The second question is: if you want to use join, should you choose a large table as the driver table or a small table as the driver table?

  1. If it is index nested loop join algorithm, small table should be selected as driving table;
  2. For block nested loop join algorithm:
    • In join_ buffer_ When the size is big enough, it’s the same;
    • In join_ buffer_ When the size is not large enough (this is more common), you should choose a small table as the driver table.

Therefore, the conclusion of this problem is that small tables should always be used as driving tables.

The previous example is unconditional. If you add the restriction T2. ID < = 50 to the where condition of the statement, let’s look at the following two statements:

select * from t1 straight_join t2 on (t1.b=t2.b) where<=50;
select * from t2 straight_join t1 on (t1.b=t2.b) where<=50;

Note that in order to make the driven tables of the two statements not indexed, the join field uses the field B without index.

But if you use the second statement, join_ Buffer only needs to be placed in the first 50 lines of T2, which is obviously better. So here, “the first 50 rows of T2” is the relatively small table, that is, the “small table”.

Let’s look at another set of examples

select t1.b,t2.* from  t1  straight_join t2 on (t1.b=t2.b) where<=100;
select t1.b,t2.* from  t2  straight_join t1 on (t1.b=t2.b) where<=100;

In this example, tables T1 and T2 have only 100 rows to join. However, these two statements put join into each query_ The data in the buffer is different

  • Table t1 only looks up field B, so if you put T1 into join_ In buffer, then join_ Only the value of B needs to be put in the buffer;
  • Table t2 needs to look up all the fields, so if you put table t2 into join_ In the buffer, you need to put three fields ID, a and B.

Here, table t1 should be selected as the driver table. That is to say, in this example, “only one column of table t1 participating in join” is the relatively small table.

So, to be more precise,When deciding which table should be used as the driver table, the two tables should filter according to their own conditions. After filtering, the total amount of data of each field participating in the join is calculated. The table with small amount of data is the “small table”, which should be used as the driver table.

How to optimize join statement

Create two tables T1 and T2

create table t1(id int primary key, a int, b int, index(a));
create table t2 like t1;
drop procedure idata;
delimiter ;;
create procedure idata()
  declare i int;
  set i=1;
    insert into t1 values(i, 1001-i, i);
    set i=i+1;
  end while;

  set i=1;
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;

delimiter ;
call idata();

In table t1, insert 1000 rows of data, and a = 1001 ID value of each row. That is to say, field a in table t1 is in reverse order. At the same time, I inserted 1 million rows of data into table t2.

Multi range read optimization

Multi range read optimization (MRR). The main purpose of this optimization is to use sequential disk reading as much as possible.

First familiarize yourself with the following table: InnoDB looks up the value of primary key ID on the common index a, and then looks up the whole row of data on the primary key index according to the value of each primary key ID.

Is the process of table return to look up data row by row or batch?

Suppose you execute this statement:

select * from t1 where a>=1 and a<=100;

The primary key index is a B + tree, in which only one row of data can be found each time according to a primary key ID. Therefore, the return table must search the primary key index row by row. The basic process is shown in Figure 1.

Summary after reading MySQL (3)

If the value of a is incremented, the value of ID will become random, then random access will appear, and the performance is relatively poor. Although the “query by row” mechanism cannot be changed, it can be accelerated by adjusting the order of queries.

Because most of the data is inserted according to the increasing order of the primary key, we can think that if you query according to the increasing order of the primary key, the read of the disk is closer to the sequential read, which can improve the read performance.

This is the design idea of MRR optimization. At this point, the execution process of the statement becomes as follows:

  1. According to index a, locate the record that meets the condition, and put the ID value into read_ rnd_ Buffer;
  2. Will read_ rnd_ The ID in the buffer is sorted incrementally;
  3. The sorted ID array, in turn, looks up the records in the primary key ID index, and returns as the result.

Here, read_ rnd_ The size of the buffer is determined by read_ rnd_ buffer_ It is controlled by the size parameter. If in step 1, read_ rnd_ When the buffer is full, steps 2 and 3 will be performed first, and then the read will be cleared_ rnd_ buffer。 Then continue to find the next record of index a, and continue to cycle.

In addition, if you want to use MRR optimization stably, you need to set theset optimizer_switch="mrr_cost_based=off". (according to the official document, the current optimizer strategy tends to use MRR instead of MRR when judging consumption_ cost_ If based is set to off, MRR will be used in a fixed way.)

The following two figures are the execution process and explain results after MRR optimization.

Summary after reading MySQL (3)

Summary after reading MySQL (3)

From the explain result in Figure 3, we can see that the extra field has more using MRR, which means that MRR optimization is used. Moreover, since the_ rnd_ The buffer is sorted according to the ID, so the final result set is also in the incremental order of the primary key ID, which is opposite to the order of the rows in the result set in Figure 1.


MRR can improve the core performanceThe reason is that this query statement does a range query on index a (that is, it is a multi valued query) and can get enough primary key IDs. In this way, after sorting, we can go to the primary key index to look up the data, which can reflect the advantage of “sequencing”.

Batched Key Access

After understanding the principle of MRR performance improvement, you can understand the batched key acess (BKA) algorithm introduced by MySQL after version 5.6. This BKA algorithm is actually the optimization of NLJ algorithm.

Flow chart of NLJ algorithm:

Summary after reading MySQL (3)

The logic of NLJ algorithm is: take the value of a from the driver table t1 row by row, and then go to the driven table t2 to join. That is to say, for table t2, one value is matched each time. At this time, the advantage of MRR will not be used.

So how can we send more values to table t2 at one time? The method is to take more rows from table t1 and pass them to table t2.

In this case, take out part of the data of table t1 and put it into a temporary memory first. This temporary memory is either someone else’s or a join_ buffer。

join_ The role of buffer in BNL algorithm is to temporarily store the data of driver table. But it doesn’t work in NLJ algorithm. Then, you can just reuse join_ Buffer to BKA algorithm.

As shown in Figure 5, it is the flow of BKA algorithm optimized by NLJ algorithm above.

Summary after reading MySQL (3)

In the figure, in the join_ The data in the buffer is P1P100, which means that only the fields needed by the query will be retrieved. Of course, if the join buffer can’t hold P1All the data of P100 will be divided into multiple segments to perform the above process.

So, how to enable the BKA algorithm?

If you want to use BKA optimization algorithm, you need to set the

set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

The first two parameters are used to enable MRR. The reason is that the optimization of BKA algorithm depends on MRR.

Performance of BNL algorithm

Question:When using the block nested loop join (BNL) algorithm, the driven table may be scanned several times. If the driven table is a large cold data table, in addition to causing high IO pressure, what impact will it have on the system?

InnoDB optimizes the LRU algorithm of buffer pool, that is, the data pages read into memory from disk for the first time will be placed in old area first. If the data page is no longer accessed after one second, it will not be moved to the LRU linked list header, which has little impact on the hit rate of buffer pool.

However, if a join statement using BNL algorithm scans a cold table for more than one second, the data page of the cold table will be moved to the LRU linked list head when scanning the cold table again.

This corresponds to the case that the amount of data in the cold table is less than 3 / 8 of the whole buffer pool and can be completely put into the old area.

If the cold table is very large, another situation will appear: the data pages accessed by the business normally have no chance to enter the young area.

Due to the existence of optimization mechanism, a normally accessed data page needs to be accessed again every 1 second to enter the young area. However, because our join statement is reading disk circularly and weeding out memory pages, the data pages entering the old area are likely to be weeded out in one second. In this way, the data pages in the young area of the buffer pool of this MySQL instance are not reasonably eliminated during this period of time.

In other words, these two situations will affect the normal operation of buffer pool.

Although the large table join operation has an impact on Io, the impact on Io ends after the statement execution. However, the impact on the buffer pool is persistent, and it needs to rely on subsequent query requests to slowly recover the memory hit rate.

In order to reduce this effect, we can consider increasing the join_ buffer_ Size to reduce the number of scans of the driven table.

In other words, the influence of BNL algorithm on the system mainly includes three aspects

  1. It may scan the driven table several times and occupy disk IO resources;
  2. Judging the join condition requires m * n times of comparison (M and N are the number of rows of the two tables respectively). If it is a large table, it will occupy a lot of CPU resources;
  3. It may cause the hot data of buffer pool to be eliminated and affect the memory hit rate.

Before executing the statement, we need to confirm whether to use BNL algorithm through theoretical analysis and viewing the explain result. If it is confirmed that the optimizer will use BNL algorithm, it needs to be optimized. The common method of optimization is to add index to the join field of the driven table and turn BNL algorithm into BKA algorithm.


In some cases, the index can be built directly on the driven table, and then it can be directly converted to BKA algorithm.

However, sometimes we do encounter some situations that are not suitable for building an index on the driven table. For example, the following statement:

select * from t1 join t2 on (t1.b=t2.b) where t2.b>=1 and t2.b<=2000;

One million rows of data are inserted into table t2, but after filtering by where condition, only 2000 rows of data need to participate in the join. If this statement is also a low-frequency SQL statement, it will be a waste to create an index for this statement in field B of table t2.

However, if BNL algorithm is used to join, the execution process of this statement is as follows:

  1. Take out all the fields of table t1 and store them in join_ Buffer. This table has only 1000 rows, join_ buffer_ The default value of size is 256K, which can be saved completely.
  2. Scan the table t2, take out each row of data and join_ Compare the data in the buffer,
    • If T1. B = T2. B is not satisfied, skip;
    • If T1. B = T2. B is satisfied, then judge other conditions, that is, whether T2. B is in [12000]. If it is, it will be returned as part of the result set, otherwise it will be skipped.

For each row of table t2, when judging whether the join is satisfied, you need to traverse the join_ All rows in the buffer. Therefore, the number of times to judge the equivalence condition is 10 million * 1 million = 1 billion, which is a heavy workload.

Summary after reading MySQL (3)

Summary after reading MySQL (3)

As you can see, the extra field in the explain result shows that BNL algorithm is used. In the test environment, this statement needs to be executed for 1 minute and 11 seconds.

Creating an index in field B of table t2 will waste resources, but if you don’t create an index, the equivalent condition of this statement needs to be judged one billion times, which is also a waste. So, is there a way to have the best of both worlds?

At this time, you can consider using a temporary table. The general idea of using temporary table is as follows:

  1. Put the data satisfying the condition in table t2 into the temporary table tmp_ T medium;
  2. In order to let join use BKA algorithm, TMP is given to the temporary table_ Field B of T plus index;
  3. Let table t1 and TMP_ T do the join operation.

At this time, the corresponding SQL statement is written as follows:

create temporary table temp_t(id int primary key, a int, b int, index(b))engine=innodb;
insert into temp_t select * from t2 where b>=1 and b<=2000;
select * from t1 join temp_t on (t1.b=temp_t.b);

Figure 8 shows the execution effect of this statement sequence.

Summary after reading MySQL (3)

It can be seen that the total execution time of the three statements in the whole process is less than 1 second, compared with the previous 1 minute and 11 seconds, the performance has been greatly improved. Next, let’s look at the consumption of this process:

  1. Execute the insert statement to construct temp_ In the process of inserting data into table t, the whole table of table t2 is scanned, and the number of rows scanned here is 1 million.
  2. After the join statement, scan table t1, where the number of scan rows is 1000; in the process of join comparison, 1000 queries with index are made. Compared with the join statement before optimization, the optimization effect is obvious.

In general, whether adding indexes to the original table or using the temporary table with indexes, our idea is to let the join statement use the indexes on the driven table to trigger the BKA algorithm and improve the query performance.

Extended hash join

In fact, the above calculation of 1 billion times of that operation seems a bit silly. If you join_ If the buffer maintains a hash table instead of an unordered array, it is not a billion judgments, but a million hash lookups. In this case, the execution speed of the whole statement is much faster, right?

This is one of the reasons why MySQL optimizer and executor have been criticized: it does not support hash join. Moreover, MySQL’s official roadmap has not put this optimization on the agenda for a long time.

In fact, we can implement this optimization idea on the business side by ourselves. The implementation process is as follows:

  1. select * from t1;Get all 1000 rows of data in table t1 and store it in a hash structure on the business side, such as set in C + + and dict in PHP.
  2. select * from t2 where b>=1 and b<=2000;Obtain 2000 rows of data in table t2 that meet the conditions.
  3. Take the 2000 rows of data, line by line, to the business end, and look for the matching data in the hash structure data table. The row of data that meets the matching conditions is regarded as a row of the result set.

The join statement of the following three tables:

select * from t1 join t2 on(t1.a=t2.a) join t3 on (t2.b=t3.b) where t1.c>=X and t2.c>=Y and t3.c>=Z;

If you change it to straight_ Join, how to specify the join order, and how to create indexes for three tables.

The first principle is to use BKA algorithm as much as possible. It should be noted that when using BKA algorithm, it is not “first calculate the results of two table joins, and then join with the third table”, but directly nest the query.

The specific implementation is: in the three conditions of T1. C > = x, T2. C > = y, T3. C > = Z, select the table with the least data after filtering as the first drive table. At this time, the following two situations may occur.

In the first case, if the selected table is T1 or T3, the rest will be fixed.

  1. If the driving table is T1, then the join order is T1 – > T2 – > T3, and the index should be created on the driven table fields, that is, T2. A and T3. B;
  2. If the driver table is T3, the join order is T3 – > T2 – > T1, and the indexes need to be created on T2. B and T1. A.

At the same time, you need to create an index on field C of the first driver table.

In the second case, if the first selected driver table is table t2, the filtering effect of the other two conditions needs to be evaluated.

In short, the overall idea is to try to make the data set of the drive table participating in the join as small as possible, because in this way, our drive table will be smaller.

Why can temporary tables have duplicate names

Temporary tables are used to optimize join queries.

create temporary table temp_t like t1;
alter table temp_t add index(b);
insert into temp_t select * from t2 where b>=1 and b<=2000;
select * from t1 join temp_t on (t1.b=temp_t.b);

Difference between temporary table and memory table:

  • Memory table refers to the table that uses the memory engine. The syntax for creating a table is create table engine=memory。 The data of this kind of table is stored in memory, and will be cleared when the system restarts, but the table structure is still in use. In addition to these two features, it is a normal table in terms of other features.
  • The temporary table can use various engine types. If you use the temporary table of InnoDB engine or MyISAM engine, the data is written to disk. Of course, temporary tables can also use the memory engine.

Characteristics of temporary tables

Take a look at the following sequence of operations:

Summary after reading MySQL (3)

It can be seen that the temporary table has the following characteristics in use:

  1. The syntax for creating a table is create temporary table.
  2. A temporary table can only be accessed by the session that created it and is not visible to other threads. Therefore, the temporary table t created by session a in the figure is invisible to session B.
  3. A temporary table can have the same name as a normal table.
  4. When there are temporary tables and ordinary tables with the same name in session a, the show create statement and the add delete query statement access the temporary table.
  5. The show tables command does not display temporary tables.

Since the temporary table can only be accessed by the session that created it, the temporary table will be automatically deleted at the end of the session. Because of this characteristic,The temporary table is especially suitable for the join optimization scenario at the beginning of our article. Why?

The reasons mainly include the following two aspects

  1. Temporary tables of different sessions can have duplicate names. If more than one session performs join optimization at the same time, there is no need to worry about the problem of table creation failure caused by duplicate table names.
  2. Don’t worry about data deletion. If you use ordinary tables, you need to clean up the data tables generated in the intermediate process when the client breaks or the database restarts abnormally during the process execution. The temporary table will be recycled automatically, so this additional operation is not needed.

Application of temporary table

Because there is no need to worry about duplicate name conflicts between threads, temporary tables are often used in the optimization process of complex queries. Among them, the cross database query of sub database and sub table system is a typical use scenario.

The general scenario of database and table division is to distribute a large logical table to different database instances. For example. A large table HT is divided into 1024 sub tables according to field F, and then distributed to 32 database instances. As shown in the figure below:

Summary after reading MySQL (3)

In general, this kind of sub database and sub table system has a middle layer proxy. However, there are also some solutions that allow clients to directly connect to the database, that is, there is no proxy layer.

In this architecture, the choice of partition key is based on “reducing cross database and cross table queries”. If most statements contain the equivalent condition of F, then use F as the partition key. In this way, after the proxy layer parses the SQL statement, it can determine which sub table to route the statement to for query.

For example, the following statement:

select v from ht where f=N;

At this time, you can confirm which sub table the required data is placed on through the sub table rules (for example, n% 1024). This kind of statement only needs to access a sub table, which is the most popular statement form of sub database and sub table scheme.

However, if there is another index K on this table, and the query statement is as follows:

select v from ht where k >= M order by t_modified desc limit 100;

At this time, because the partition field F is not used in the query criteria, you can only find all the rows that meet the conditions in all the partitions, and then do the operation of order by. In this case, there are two common ideas.

The first idea is,Sorting is implemented in the process code of proxy layer.

The advantage of this method is that the processing speed is fast. After getting the data of the branch database, it directly participates in the calculation in the memory. However, the disadvantages of this scheme are obvious

  1. It needs a lot of development work. The statement we exemplified is relatively simple. If complex operations are involved, such as group by or even join, the development capability of the middle tier is required to be relatively high;
  2. There is a lot of pressure on the proxy side, especially the problem of insufficient memory and CPU bottleneck.

Another way of thinking is,Summarize the data obtained from each sub database into a table of a MySQL instance, and then perform logical operations on the summarized instance.

For example, the execution process of the above statement can be similar to this:

  • Create a temporary table temp on the summary library_ HT, the table contains three fields V, K and t_ modified;
  • Execute on each sub database
    select v,k,t_modified from ht_x where k >= M order by t_modified desc limit 100;
  • Insert the result of sub database execution into temp_ In the table;
  • implement
    select v from temp_ht order by t_modified desc limit 100;

The corresponding flow chart of this process is as follows:

Summary after reading MySQL (3)

In practice, it is often found that the calculation amount of each sub database is not saturated, so the temporary table temp is directly added_ The HT is put on one of the 32 sub databases.The query logic is similar to figure 3

Why can temporary tables have duplicate names?

Different threads can create temporary tables with the same name. How can this be done?


create temporary table temp_t(id int primary key)engine=innodb;

When using this statement, MySQL needs to create a frm file for the InnoDB table to save the table structure definition, as well as a place to save the table data.

The frm file is placed in the temporary file directory. The suffix of the file name is. Frm, and the prefix is “# SQL {process ID} {thread ID} serial number”. You can use the select @ TMPDIR command to display the temporary file directory of the instance.

There are different ways to store the data in the table in different versions of MySQL

  • In version 5.6 and earlier, MySQL will create a file with the same prefix and suffix. IBD in the temporary file directory to store data files;
  • Since version 5.7, MySQL has introduced a temporary file table space to store the data of temporary files. As a result, we no longer need to create IBD files.

From the prefix rules of file names, we can see that in fact, when we create an InnoDB temporary table called T1, MySQL thinks that the name of the table we create is different from that of the ordinary table t1. Therefore, when there is an ordinary table t1 under the same library, we can create another temporary table t1.

for instance:

Summary after reading MySQL (3)

The process number of this process is 1234, the thread ID of session a is 4, and the thread ID of Session B is 5. So you can see that the temporary tables created by session a and Session B will not have the same name on the disk.

MySQL maintains data tables. In addition to having files physically, there is also a mechanism in memory to distinguish different tables. Each table corresponds to a table_ def_ key。

  • The table of an ordinary table_ def_ The value of key is obtained from “library name + table name”, so if you want to create two ordinary tables with the same name in the same library, you will find table in the process of creating the second table_ def_ The key already exists.
  • For temporary tables, table_ def_ Key adds “server” on the basis of “database name + table name”_ id+thread_ id”。

That is to say, the two temporary tables T1 created by session a and Session B, their tables_ def_ Different keys have different disk file names, so they can coexist.

In the implementation, each thread maintains its own temporary list. In this way, when operating a table in a session, first traverse the linked list to check whether there is a temporary table with this name. If there is, the temporary table will be operated first, and if there is no ordinary table, the operation of “drop temporary table + table name” will be performed for each temporary table in the linked list at the end of the session.

At this time, you will find that the drop temporary table command is also recorded in binlog. You will find it strange that temporary tables can only be accessed in threads. Why do you need to write them to binlog?

This requires primary and standby replication.

Temporary table and active standby replication

Since the binlog is written, it means that the standby database needs to.

Imagine executing the following sequence of statements on the main library:

create table t_normal(id int primary key, c int)engine=innodb;/*Q1*/
create temporary table temp_t like t_normal;/*Q2*/
insert into temp_t values(1,1);/*Q3*/
insert into t_normal select * from temp_t;/*Q4*/

If the operations on the temporary table are not recorded, only create table t is available in the standby database_ Normal table and insert into t_ normal select * from temp_ T the binlog log of these two statements, and the standby database is running to insert into t_ When normal, an error “table temp” will be reported_ T does not exist.

How about setting binlog to row format? Because when binlog is in row format, insert into t is recorded_ Normal binlog records the data of this operation, that is, write_ The logic recorded in row event is “insert a row of data (1,1)”.

That’s true. If the current binlog_ Format = row, then statements related to temporary tables will not be recorded in binlog. That is to say, only in binlog_ When format = statement / mixed, the operation of temporary table will be recorded in binlog.

In this case, the statement to create the temporary table will be passed to the standby database for execution, so the synchronization thread of the standby database will create the temporary table. The main database will delete the temporary table automatically when the thread exits, but the standby database synchronization thread is running continuously. Therefore, at this time, you need to write a drop temporary table on the main library and send it to the standby library for execution.

Question:When MySQL records binlog, both create table and alter table statements are recorded as is, and even spaces are not changed. But if you execute drop table t_ Normal, the system record binlog will be written as:

DROP TABLE `t_normal` /* generated by server */

That is, it has been changed to a standard format. Why do you do this?

  • The drop table command can delete multiple tables at one time. For example, in the above example, set binlog_ Format = row, if “drop table T” is executed on the main library_ normal, temp_ T “, then only the following can be recorded in binlog:
    DROP TABLE `t_normal` /* generated by server */

Because there is no table temp on the standby database_ t. Rewriting this command and then transferring it to the standby database for execution will not cause the standby database synchronization thread to stop.

Therefore, when the drop table command records binlog, the statement must be rewritten. “/ * generated by server * /” indicates that this is a command rewritten by the server.

When it comes to active and standby replication,There is another problem to be solved: it doesn’t matter that different threads on the primary database create temporary tables with the same name, but how to transfer them to the secondary database for execution?

For example, in the following sequence, instance s is the standby Library of M.

Summary after reading MySQL (3)

The two sessions on the main library m create the temporary table t1 with the same name, and the two create temporary table t1 statements will be transferred to the standby library s.

However, the application log threads of the standby database are shared, that is to say, the create statement should be executed twice in the application thread. (even if multithreaded replication is on, it may be assigned to the same worker in the slave library for execution). So, will this cause the synchronization thread to report an error?

Obviously not, otherwise the temporary table would be a bug. In other words, when the standby thread is executing, it should treat the two T1 tables as two different temporary tables. How is this achieved?

When MySQL records binlog, it will write the thread ID of the main database executing this statement to binlog. In this way, the application thread in the standby database can know the thread ID of the main database to execute each statement, and use this thread ID to construct the table of the temporary table_ def_ key:

  1. The temporary table t1 of session a is in the table of standby database_ def_ Key is: library name + T1 + “serverid of M” + “thread of session a”_ id”;
  2. The temporary table t1 of Session B is in the table of standby database_ def_ Key is: library name + T1 + “m’s serverid” + “Session B’s thread”_ id”。

Due to table_ def_ The key is different, so the two tables will not conflict in the application thread of the standby database.

Why can’t rename a temporary table?

  • In the implementation, when the rename table statement is executed, it is required to go to the disk to find the file according to the rule of “library name / table name. Frm”, but the frm file of the temporary table on the disk is placed in the TMPDIR directory, and the rule of the file name is “# SQL {process ID}”_ {thread ID}_ Serial number. Frm “, so the error” file name not found “will be reported.

This work adoptsCC agreementReprint must indicate the author and the link of this article

Recommended Today

DK7 switch’s support for string

Before JDK7, switch can only support byte, short, char, int or their corresponding encapsulation classes and enum types. After JDK7, switch supports string type. In the switch statement, the value of the expression cannot be null, otherwise NullPointerException will be thrown at runtime. Null cannot be used in the case clause, otherwise compilation errors will […]