Why is paging slow when using limit and offset

Time:2020-8-17

Let’s start with a question

When I was in Tencent five years ago, I found that MySQL requests were very slow in paging scenarios. When the data volume is only 10W,select xx fromIt’s about two or three seconds for a single machine.

I asked my master why. He asked, “in the index scenario, what is the time complexity of getting the nth largest number in MySQL?”

The search for answers

Confirm scenario

Suppose there is an index on status.select * from table where status = xx limit 10 offset 10000

It’s going to be very slow. There is a delay of a few seconds when the amount of data is small.

Xiaobai answers

At that time, I felt very secure. My master took care of everything. Anyway, the technology was the worst in the group. I guessed a log (n), and I wanted to find a node that was not log (n). Naturally, the master asked me to study it myself.

This stage took 10 minutes.

Continue to answer

If you analyze it carefully, you will find it difficult to find it through the index. Because you don’t know the distribution of the first 100 numbers in the left subtree and the right subtree, it can’t take advantage of the binary tree’s search feature.

Through learning, we learned that the index of MySQL is B + tree.

When you look at this picture, you will suddenly see the light. The 100th largest tree can be found directly through the linked list of leaf nodes with O (n) complexity. But even o (n), it’s not so slow. Is there any reason.

This stage, mainly through the online search of information, intermittent use of 10 days.

Systematic learning

Here we recommend two books, one is “MySQL technology inside InnoDB storage engine”, through which you can have a deeper understanding of InnoDB implementation mechanism, such as mvcc, index implementation and file storage.

The second book is “high performance mysql”. This book starts from the level of use, but it is more in-depth, and mentions a lot of design ideas.

The combination of the two books, repeated understanding, MySQL is barely able to enter the house.

Here are two key concepts:

  • Cluster index: contains the primary key index and the corresponding actual data. The leaf node of the index is the data node
  • Secondary index: it can be understood as a secondary node. Its leaf node is also an index node, which contains the primary key ID.

Even if the first 10000 items are thrown away, MySQL will check the data in the cluster index through the primary key ID on the secondary index. This is 10000 random IO, which naturally slows down to husky.

Here, you may ask why this behavior occurs. This is related to the stratification of MySQL. Limit offset can only be applied to the result set returned by the engine layer. In other words, the engine layer was innocent, and he didn’t know the 10000 were going to be thrown away.

The following is a hierarchical diagram of MySQL. You can see that the engine layer and the server layer are actually separated.

Until this point, we probably understood the reason for the slow. This stage took a year.

grasp a typical example and you will grasp the whole category

At this time, I have been working for 3 years, and I have started to look at some source codes. After watching etcd, I read the source code of tidb. No matter what kind of database, in fact, the query of a statement is composed of logical operators.

Introduction to logic operators

Before writing the specific optimization rules, we will briefly introduce some logic operators in the query plan.

  • Datasource this is the data source, that is, the T in select * from t.
  • Selection, for exampleselect xxx from t where xx = 5Where filter conditions inside.
  • Projection projection,select c from tThe C column is a projection operation.
  • Join connection,select xx from t1, t2 where t1.c = t2.cIt is to join the two tables T1 and T2.

Selection, projection and join (SPJ) are the most basic operators. Among them, join has internal connection, left outer connection and right outer connection.

select b from t1, t2 where t1.c = t2.c and t1.a > 5After becoming a logical query plan, the datasource corresponding to T1 and T2 is responsible for retrieving the data.

Join the results of the two tables according to T1. C = T2. C, then make a selection filter according to T1. A > 5, and finally project column B.

The figure below shows the unoptimized representation:

Therefore, it is not that MySQL does not want to pass the limit and offset to the engine layer, but because of the division of logical operators, it is impossible to know how much qualified data the specific operator contains.

How to solve it

Two solutions are mentioned in high performance mysql

Scheme 1

According to the actual business needs, see if it can be replaced by the functions of the next page and the previous page, especially in IOS and Android. The previous complete paging is not common.

This is to say, replace limit, offset with > auxiliary index (i.e. search condition) ID. When the ID is called again, it needs to be returned to the front end.

Scheme 2

The front is just. This paper introduces a concept: index coverage: when the auxiliary index queries only the ID and the secondary index itself, then there is no need to query the clustered index.

The idea is as follows: select XXX, XXX from in (select id from table where second)_ Index = XXX limit 10 offset 10000) is to find the unique ID value of the database corresponding to the data from the conditional query. Because the primary key is on the secondary index, it is not necessary to return to the disk of the cluster index to pull it. Then query the cluster index through the 10 primary key IDs that have been limited. It’s only 10 random IOS.

When the business needs paging, the performance can be greatly improved by using this scheme. Generally, it can meet the performance requirements.

Write it at the end

Thank you very much for my master’s guidance in the three years before I graduated, and gave me a lot of patience. I was assigned reading tasks on holidays, inspected my learning progress during lunch break, and guided me to explore problems by asking questions. After I graduated from Tencent, every meeting gave me a lot of ideas, taught me to teach and solve doubts, and none of them failed to achieve the ultimate goal.

The above is the whole content of this article, I hope to help you in your study, and I hope you can support developeppaer more.