Background of the problem
The order table in MySQL (InnoDB) needs to be paged and queried in chronological order, and the primary key is not incremental in time dimension. The order table is more than one million in size. How to effectively realize this requirement?
Note: This article does not mainly explain how to build an index. The following analysis is based on the premise that there is an appropriate index
Preliminary scheme 1
As we all know, there is a usage of limit offset and PageSize in MySQL to realize paging query
select * from order where user_ Id = XXX and [other business conditions] order by created_ time, id limit offset, pageSize
Because created_ Time may be repeated, so the ID should be added when order by to ensure the certainty of the order
This scheme will not expose problems when the table size is small. When the order table grows to 100000 levels and queries the next few pages, the execution speed will obviously slow down to 100ms. If the amount of data increases to millions, the time consumption will reach seconds. If it grows to 10 million levels, the time-consuming will become totally unacceptable (such online slow SQL has been checked )
Why does scenario 1 perform so poorly in large tables? We can guess how MySQL performs this query
Suppose we’re in user_ id，created_ Time, and [other business conditions] have established a joint index. When I want to find the records between 100000 and 100049, because the index of MySQL is B + tree structure, unlike arrays which can randomly locate the N record, it needs to spend a lot of cost to find the location of N. the larger the n, the greater the cost
Leaving aside the details of B + tree, we can also understand it with the help of SQL of the total number of records in the statistical table
select count(1) from order
If the nth record can be located very efficiently, the above statistics can also be executed very efficiently. But in fact, it is very slow to count the total number of records in a large table (this article is in the scenario of InnoDB)
The root cause of the inefficiency of scheme 1 is that the cost of locating offset is too high and the order of index is not fully utilized
The feature of index (B + tree) is that the data is ordered. Although the efficiency of finding the nth record is relatively low, it is very efficient to find the position of a certain data in the index (the index is used to solve this problem)
Let’s change our thinking. Each time we get 50 records, the first time we get them, we specify to continue to fetch 50 records from the last ending position. In this way, we can take advantage of the order of the upper index
Let’s first look at an example of paging query in the order of ID
select * from order where id > 'pre max id' order by id limit 50
The first query does not need to be conditional, and the subsequent query will pass in the maximum ID of the previous query. A simple analysis shows that when MySQL is executed, it first locates to the location of pre Max ID (the ID is orderly and the location is very fast), and then 50 records can be retrieved from here. The whole process is very efficient
Let’s go back to the initial question, “paging query by chronological order, and the primary key is not incremental in time dimension”. At this time, we can’t use ID as the paging condition, because paging by it is not in chronological order, but we can’t directly change ID to time, because the time may be repeated. Let’s analyze it
If the last record in the previous page was created with id = DDD_ The next query uses created_ If time > 2019-02-03 is taken as the condition, the record with id = YYY will be omitted. If it is replaced by created_ Time > = 2019-02-03 is not good, id = DDD this record was found again
For the problem of missing or duplicate data, I see a solution as follows:
The query can be divided into three situations
- First query, created_ Time > =’xxx-xx-xx ‘, if it is not required to start at a certain time, it is unconditional
select * from order where user_ Id = XXX and [other business conditions] and created_ time >= 'xxxx-xx-xx' order by created_ time, id limit pageSize
- If the number of records queried last time is equal to PageSize, use created_ Time and ID to query, in order to prevent created_ Time is missing data when the boundary is duplicated
select * from order where user_ Id = XXX and [other business conditions] and created_ time = 'created_ time of latest recored' and id > 'id of latest recored' order by created_ time, id limit pageSize
- If the number of records in the last query is less than PageSize, and the last query is a second query, only use created_ Time,
select * from order where user_ Id = XXX and [other business conditions] and created_ time > 'created_ time of latest recored' order by created_ time, id limit pageSize
created_ Time cannot be null. No = and > will return null, which will cause the corresponding result not to be found. If NULL exists, you need to change = and > to is null and is not null respectively
The above method can solve the problem of missing data or duplication, and it has good performance, but the disadvantages are obvious. The query is too complex, the score is executed in different SQL, and the paging is unstable. The number of records queried in the middle may be less than PageSize (if there are no duplicate items, there will be twice as many queries with empty results). In fact, there are still some problems data
Further in-depth analysis
I tried to find the information on the Internet. I only found that the pagination order was ID, and then I used the method of ID >’pre Max ID ‘. We should use the repeatable created_ Time is pagination order, how to write concise and efficient SQL?
If you want to be an excellent programmer, I think the ability to analyze and solve new problems is essential. Even if you can find a solution on the Internet, excellent analytical ability is also helpful to learn from and combine your own scenarios to optimize a better personalized solution.
We are at (user_ id,created_ We know that the secondary index of InnoDB contains the primary key, and the primary key must not be repeated. This means that the order of each record in the index is completely determined, and there is no duplication
The order we want to paginate is consistent with the order of this index. We only need to fetch data along the index in batches. This is a very direct use of the index. Why can’t I do it now?
If I am a designer of MySQL, how can I support this very common and direct requirement? Or not?
Let me give an example, such as the sorting based TreeSet in Java. I guess it must have floor and ceiling methods (return the first element in the set that is greater than or less than the specified element). This is something that should be included in the data structure based on sorting. If it does not, it would have been sprayed and added
Back to the topic of index, this direct requirement should be supported. Otherwise, the question becomes: what syntax should be used to implement the user based composite index_ id,created_ Time, ID) sequence traversal?
At this point, I recall the combination of (a, b) in ((1,2), (3,4), (7,4)) used before, and then guess that it also supports greater than or less than this kind of comparison. Run to MYSQL to verify it:
select (3,7)>(3,7), (3,6)>(3,7), (3,8)>(3,7), (4,7)>(3,7), (4,2)>(3,7); return: 0 0 1 1 1
In this way, the problem becomes as simple as ID >’pre Max ID ‘.
This method is also found in official documents, which are officially called “row comparisons.”
Seeing this, you may be as happy and excited as I was at that time. A perfect solution is in front of you. However, the MySQL optimizer is not as smart as we think. In front of “row comparison”, it becomes a fool and can’t use the index well
At this time, I went back to test the equivalent writing of “line comparison”
(a,b)>(x,y) Equivalent to a>x or (a=x and b>y)
It is found that this kind of writing method which seems to be very complex and still has or can use index very well, and its efficiency is very high. Even if (a, B, c) > (x, y, z) is changed into a very complex equivalent writing method:
a>x or (a=x and (b>y or (b=y and c>z)))
I can also use the index very well. At this time, I really don’t know whether to praise it or to scold it. Alas
As for the index selection of “line comparison”, you can find such information on the official website. It is said that when the index coverage is not enough, it is suggested to split it into common writing method. In this way, it seems that people have some difficulties
With the syntax of a > x or (a = x and b > y), which is equivalent to combination comparison, and can use index correctly, it can write efficient and concise SQL
select * from order where user_ Id = XXX and [other business conditions] and (created_ time > 'created_ time of latest recode' or (created_ time = 'created_ time of latest recode' and id > 'id of latest recode')) order by created_ time, id limit pageSize
This method is the same as the paging query with ID as the order. The first query can remove the combination condition, and the code is slightly complex. Fortunately, it can use the index to combine. It is very efficient, time-consuming and stable, and the performance will not be reduced due to traversing to the end
Unfortunately, the most elegant way is to meet a two fool optimizer. According to the specific syntax they support (smaller range, more fixed pattern) to accurately express the query requirements, it should be easier for the optimizer to identify and implement with the optimal scheme. However, the result is not satisfactory
I hope that MySQL can better support “line comparison” in the future (8.0.19 still has problems)
Created is not allowed here_ Time is null, because the null value participates in the operation of > and = and the result is null, that is, the condition is not true, and the corresponding result cannot be found out.
If NULL exists, some adjustments should be made if the last record of the previous batch of data is created_ If time is null (null is regarded as a minimum in the index), you can change it as follows:
(created_time is not null or (created_time is null and id > 'id of latest recode'))
Index can still be used to achieve efficient paging query
In the case of small table, scheme 1 is simple and convenient. It can only pass page number and page size, and can jump to the specified page randomly
Scheme 2 and scheme 3 have excellent performance and stability in the case of large tables. However, they can not jump to the page randomly and need to pass in the sorting field of the previous page. This drawback can be avoided to a certain extent. For example, many paging pages are turned down page by page, such as microblog, circle of friends dynamic, etc., or batch processing of full table data, without random jump
Careful students may find that there are “other business conditions” in the “where” condition. Can you still go to the index normally? Will a full table scan occur? In fact, this problem can be avoided. If you have time, write another case in which the implementation plan is not completely reliable.
Note: sometimes the execution plan can’t reflect the actual execution effect, so I didn’t post the execution plan; the MySQL versions I used were 5.7.23 and 8.0.19
I figured out the writing method of scheme 3 by myself. I didn’t find similar information on the Internet. It’s a unique secret skill. Besides, I think the thinking process in [further in-depth analysis] is also very valuable. If we develop this thinking habit, it is conducive to innovation, solve problems that others have not encountered, and know which direction to look for answers in unknown fields Or find new ways to better solve old problems.
If this article is helpful or valuable, please give me a like, so I will be more motivated to share my experience
Please indicate the source and author（ https://www.cnblogs.com/trytocatch/p/mysql-page-query.html by trytocatch)