Hello, I’m here again. It’s almost the end of the year. As an ambitious farmer, I want to save a year-end summary for myself. Simply this time, we will share the core and the most difficult to understand content in the database, that is, index.
In this blog, I will talk about my own views on index structure, and share how to understand index structure from scratch, layer by layer, and finally from the top.
Multi page mode
In the multi page mode, MySQL can finally complete the storage of multi data, which is to open up a new page, put multiple pieces of data in different pages, and then use the data structure of linked list to connect each page. Then consider the fourth question:Does multi page query affect query efficiency?
The effect of multi page mode on query efficiency
In response to this question, since it is asked, the answer is yes. Multi page will have a certain impact on the query efficiency. The impact is mainly reflected in that multi page is also a linked list structure in nature. As long as it is a linked list structure, the query efficiency will not be high.
Assuming that there are many more data, the database will open up many new pages, and these new pages will be linked together like a linked list. When we want to query a piece of data in so many pages, it will traverse from the node to the page where the data we want to find exists. We have managed to optimize the query efficiency of the data in the page through the page directory. Now There are also linked lists with pages as the unit. Isn’t that a waste of all previous efforts?
How to optimize multi page mode
As the multi page mode will affect the efficiency of query, there must be a way to optimize the query in the multi page mode. I believe that some students have guessed out that since we can use page directory to optimize the data area within the page, we can also take a similar way to optimize this multi page situation.
Yes, the intra page data area and the multi page mode are essentially linked lists, so they can be optimized in the same way. They are directory pages.
So we compare the data area in the page to analyze how to optimize the multi page structure. On a single page, weThe directory entry of the page directory is used to point to a row of data, which is the minimum data existing in the directory entry. Then you can find the required data through the page directory.
So for multi page structure, you can also use this method, using a directory item to point to a page, and this directory item stores the index value of the minimum data stored in this page. What’s different from the page catalog is that,The level of this directory management is page, and the level of page directory management is row.
In this case, the structure of our multi page mode is as shown in the following figure:
There is a directory page to manage the page directory. The data in the directory page is the smallest data in the page to which it points.
One thing to note here is: ActuallyThe essence of a directory page is also a page. The data stored in a normal page is project data, while the data stored in a directory page is the address of a normal page.
If we want to find the data with id = 19, then according to the previous search method, we need to start from the first page. If we find that it doesn’t exist, we need to go to the second page to find the data with id = 19 until we find the fourth page. However, if there is a directory page, we can use id = 19 to compare it with the data stored in the directory page. It is found that 19 is larger than any one of the data, so it’s better Enter the page with id = 16 to search, and then directly search through the page directory row level data within the page. Soon, you can find the data with ID 19. With more and more data, the efficiency of this structure is more and more obvious compared with the common multi page mode.
Returning to the main topic, I believe that the students who have a better understanding of MySQL have found out that the final picture we drew is an index structure in MySQL – B + tree.
Introduction of B + tree
We can form the following picture by macroscoping the multi page pattern map of the existing catalog page that we have drawn:
This is a B + tree formed from simplicity to complexity. This is a MySQL B + tree, an index structure of MySQL. Each node can be understood as a page, and the leaf node is the data page, and the node other than the leaf node is the directory page.
It can also be seen in the figure that non leaf nodes only store index, while only leaf nodes store real data, which also conforms to the characteristics of B + tree.
Advantages of B + tree
Because the leaf node stores all the data and has pointer connection, each leaf node is logically connected, so it is more friendly to range search.
All the data of B + tree is on the leaf node, so the query efficiency of B + tree is stable, generally three queries.
B + tree is good for database scanning.
B + tree is good for disk IO, because its layer height will not increase due to data expansion (the three-layer tree structure can store about 20 million data).
The complete structure of the page
After finishing the concept of page and how to combine the structure called B + tree step by step, I believe that everyone has a clear understanding of page, so here we will start to talk about the official concept. Based on what we said above, giving a complete page structure is also a supplement to our understanding of page structure.
The above figure shows the page data structure. The file header field is used to record the header information of the page. The more important fields are fil? Page? Prev and fil? Page? Next fields. Through these two fields, we can find the previous page and the next page of the page. In fact, all pages can form a two-way linked list through two fields.
The page header field is used to record the status information of the page. The next two pseudo line records, the infimum record and the supremum record, are smaller than any primary key value in the page, and the supremum record is larger than any primary key value in the page. The pseudo record forms the boundary of the record in the page.
User records stores the actual data row records. The specific row record structure will be detailed in the second section of this article. Free space stores free space. Deleted Row records will be recorded as free space. Page directory records information related to binary lookup. File tracker stores data such as checksums used to check data integrity.
Other knowledge points of MySQL based on B + tree
See here, we have learned that MySQL starts from a single piece of data to reduce the number of disk IO through the page, and implements the page directory in the page to optimize the query efficiency in the page, then uses the multi page mode to store a large number of data, and finally uses the directory page to achieve the query efficiency of the multi page mode and form the index structure we call the B + tree. Now that we have said that, let’s talk about other knowledge points of MySQL.
Clustered index and non clustered index
The so-called clustered index is to put the index and data together and find the data when finding the index. The B + tree index we just saw is a kind of clustered index, while the non clustered index is to separate the data and index. When searching, you need to find the index first and then find the corresponding data through the index return table. InnoDB has and only has one clustered index, while MyISAM has non clustered indexes.
The left most prefix matching principle of union index
In MySQL database, not only a certain column can be indexed, but also a joint index can be established for multiple columns. There is a concept of the leftmost prefix matching principle in the joint index. If we understand the leftmost prefix matching principle based on B + tree, it will be relatively easy.
First, we build a joint index based on the table at the beginning of the article:
create index idx_obj on user(age asc,height asc,weight asc)
We have learned that the data structure of the index is a B + tree, and that one of the factors of the query efficiency of B + tree optimization is to sort the data. When we create the index of IDX obj, it is equivalent to creating a B + tree index, which isSort by members of the union indexThis is age, height, weight.
The students who read my previous blog know that as long as a primary key is defined in InnoDB, the primary key column will be used as a clustered index, and other indexes will be used as non clustered indexes, so naturally, this index will be a non clustered index.
So based on these, we can draw a conclusion:
IDX obj: the index will be sorted according to age, height and weight
IDX obj is a non clustered index, which needs to be returned to the table when querying
According to these two conclusions, the first thing we need to know is how to sort?
Single column sorting is very simple, compared with size, everyone can, butWhat is the principle of multi column sorting?
In fact, in mysql, there is a principle for the sorting of joint indexes. Compare the sizes from left to right. Take the index just established as an example. He will first compare the size of age. If the size of age is the same, then compare the size of height. If the size of height is not the same, then compare the size of weight, and finally sort the index.
Then we can draw a B + tree according to this order. It’s not as detailed as the above picture. Let’s simplify it:
Note: at this time, because it is a non clustered index, the leaf node no longer has data, but stores a primary key index, and finally queries the data back and forth through the primary key index.
With the structure of B + tree, we can understand the principle of leftmost prefix matching.
Let’s write a query first
SELECT * FROM user WHERE age=1 and height = 2 and weight = 7
There is no doubt that this statement will follow the index idx obj.
Let’s look at another statement:
SELECT * FROM user WHERE height=2 and weight = 7
Think about it. Will this SQL go through index?
The answer is no, so the direction of our analysis is why this statement does not follow the index.
In the above, we mentioned a multi column sorting principle, which is to compare from left to right and then sort. Our IDX [obj] index is age, height, weight from left to right, so when we use height and weight as query criteria, because of the lack of age, we cannot compare from age.
You may have questions when you see some friends here, thenCan’t we just compare height with weight?Obviously not. Let’s give you an example,If we write a question mark for the missing column, then the query condition of this statement will become 27. Then we start from the root node of the B + tree in this lesson. There are 127 and 365 on the root node. If we compare height and weight, we must go on the 127 side. But what if the missing column number is greater than 3?For example, 427527627, if the index is used to query the data, it will lose the data and make a wrong query.So in this case, it is absolutely impossible to query by index.This is the cause of the leftmost prefix matching principle.
- For the leftmost prefix matching principle, MySQL will match all the way to the right until it encounters a range query (>, <, between, like). For example, a = 3 and B = 4 and C > 5 and d = 6. If the index of (a, B, C, d) order is established, D cannot use the index. If the index of (a, B, D, c) is established, the order of a, B, D can be adjusted arbitrarily.
- =And in can be out of order. For example, a = 1 and B = 2 and C = 3 can be used to establish (a, B, c) indexes in any order. MySQL query optimizer will help you optimize the indexes into recognizable forms.
According to what we know, we can draw a conclusion:
As long as you can’t sort and compare the sizes, you can’t go through the union index.
Here are a few more statements:
SELECT * FROM user WHERE age=1 and height = 2
This statement can go through IDX obj index because it can pass comparison (12? < 365).
SELECT * FROM user WHERE age=1 and weight=7
This statement can also use the ind_obj index, because it can also use the left subtree through comparison (1? 7 < 365), but in fact, weight does not use the index, because according to the leftmost matching principle, if two pages of age are equal to 1, then the height will be compared, but the height is not used as the query criteria here, so MySQL will load these two pages into memory for the most After the weight field comparison, scan the query.
SELECT * FROM user where age>1
This statement does not walk index, but it can walk index. What does that mean? This SQL is very special. Because it has indexes that can be compared, it can also query results by index. However, because this situation is range query and full field query. If it goes by index, it also needs to return to the table. The MySQL query optimizer will think that the efficiency of index walking is lower than that of full table scanning. Therefore, MySQL will optimize it and let it scan the whole table directly 。
SELECT * FROM user WEHRE age=1 and height>2 and weight=7
This statement can be indexed because it can be compared by age, but weight does not use index because height is range finding. Similar to the second statement, if the height of two pages is greater than 2, MySQL will load both pages of data into memory, and then match the correct data by weight.
Why does InnoDB have only one clustered index instead of using clustered indexes for all indexes?
Because clustered index is to store the index and data in the leaf node. If all indexes use clustered index, each index will save a copy of data, which will cause data redundancy. In the case of large amount of data, this kind of data redundancy is very resource consuming.
Two additional points about index
These two points are also missing from the last blog about index. Here we add them.
1. What happens when an index is explicitly created, but it fails to pass the index during execution?
Popular science time:The query optimizer can have different execution schemes for a query of SQL statement. As for the final scheme, it is necessary to select the scheme with the lowest execution cost through the optimizer.
Before a single table query statement is actually executed, the MySQL query optimizer will find out all possible schemes for executing the statement, and then find out the scheme with the lowest cost after comparison. The lowest cost solution is the so-called execution plan.
The optimization process is as follows:
1. Find all possible indexes based on search criteria
2. Calculating the cost of full table scanning
3. Calculate the cost of executing queries with different indexes
4. Compare the cost of various implementation schemes and find the one with the lowest cost.
According to the non clustered index of the table we just mentioned, this statement is due to the function of the query optimizer, resulting in no index walking:
SELECT * FROM user where age>1
2. In the case of sparse index, it is usually necessary to query data through the pointer back to the table of the leaf node. In what case, it is not necessary to return to the table?
Popular science time: covering index refers to that the execution of a query statement can only be obtained from the index without reading from the data table. It can also be called index coverage.
When a query statement meets the condition of covering index, MySQL only needs to return the data needed by the query through the index, which avoids the operation of returning to the table after the index is found and reduces I / O to improve efficiency.
For example, table
covering_index_sampleThere is a common index in
idx_key1_key2(key1,key2)。 When we use SQL statements:
select key2 from covering_index_sample where key1 = 'keytest';At that time, you can query by overwriting the index without returning the table.
SELECT age FROM user where age = 1
There is no need to query the return table.
This article focuses on the index structure of MySQL, builds a B + tree index slowly from scratch, and according to this process discusses how the B + tree optimizes the query efficiency step by step.
To sum it up briefly:
Sorting: the essence of optimizing query. Sorting during insertion is actually to optimize query efficiency.
Page: used to reduce IO times, but also can use the principle of program locality to slightly improve query efficiency.
Page directory: used to avoid the weakness of linked list and scan the linked list during query.
Multi page: when the amount of data increases, a new page will be opened to save the data.
Directory page: “special page directory”, where the data stored is the address of the page. When querying, you can quickly locate the page through the directory page to avoid multi page scanning.