The index is similar to that of a book
catalogHe is helping us
Fast positioning from massive dataA data structure of a piece or range of data. Ordered arrays and search trees can be used as indexes. There are three major indexes in mysql, which are
B + tree index、
Full text index。 B + tree index is the most important index. Hash index and full-text index do not refer to too many. InnoDB does not support hash index, but hash index will be used when locating data pages inside the storage engine, which is not the focus of this article. This article will briefly introduce the B + tree index.
Basic structure of B + tree
This paper does not make a precise definition of B + tree, directly gives a schematic diagram of B + tree and makes some explanations.
< center > Figure 1: basic structure of B + tree (from the network) < / center >
B + tree is a
Multi channel balanced search tree, all nodes are called
pageA page is a data block, in which data can be placed. The page is a fixed size, which is 16kb in InnoDB. The data in the page is some key values. N keys can be divided into N + 1 intervals. Each interval has a pointer to the lower level nodes. Each page is connected in a bidirectional linked list. The key in the first layer is
OrderYes. Take disk block 1 as an example. It has two keys, 17 and 35, which are divided into three intervals (- infinite, 17) P1, [17, 35) P2, [35, + infinity] P3, which is also called fan out. The keys in the lower level nodes pointed to by P1 are smaller than 17; the keys of the inner edges of the lower level nodes pointed to by P2 are greater than or equal to 17 and less than 35; the keys of the inner edges of the subordinate nodes pointed to by P3 are greater than or equal to 35.
The process of finding data in B + tree:
For example, to find 28 in the upper tree, first locate disk 1 and click
DichotomyFind out which interval it belongs to, find P2, locate it to disk block 3, do a binary search in the key of disk block 3, find P2, locate disk block 8, and then find the key 28. For a database, finding a key must eventually locate the leaf node, because only the leaf node contains row records or primary key keys.
Insert node and delete node:
Here we will not introduce the detailed process, but give you an Amway tool:https://www.cs.usfca.edu/~gal…This tool can demonstrate the process of B + tree insertion and deletion by animation. It is very intuitive. You can try it. As shown in the figure:
< center > Figure 2: screenshot of B + tree demo tool
It is worth noting that when inserting nodes, there may be
Page splittingMay exist when the node is deleted
Page mergeThe situation. Page splitting refers to the process of dividing a page into multiple pages when a page cannot hold a new key. Page merging means that when a node is deleted and the number of keys in the page is reduced to a certain extent, it is combined with adjacent pages to become a new page. Not a page full insertion will occur page splitting, will take precedence over similar
rotateTo avoid wasting space.
The following figure shows the simplest page splitting situation. Suppose that only three keys can be placed on a page. When inserting an EFG, the leaf page is placed, so it is split into two pages, and a layer is added.
< center > Figure 3: page splitting demo < / center >
The remarkable characteristics of B + tree of database index are
High fan outThat is to say, a page can store more data, so the advantage is tree
Small heightIt’s about two to four floors,
The smaller the height, the fewer IO times to look up。
Why b+ trees
- Why not use ordered arrays
An ordered array can be searched by dichotomy, and the time complexity is O (logn)
Insert and delete operations are too expensiveFor example, if the position 0 is deleted, the data from 1 to n-1 will move forward, and the cost is O (n)
- Why don’t you use a hash watch
Hash table is useful inside the storage engine. When we build our own index, we usually don’t build hash index (InnoDB does not support it)
Hash table is a kind of structure with high efficiency. For example, we can think that the insertion, query and deletion of HashMap in Java are o (1).
At the bottom of the hash table is a
array, the length of the hashcode pair array of the data when inserting the data
Mould taking, determine his position in the array and put it in the array. Of course, there may be a place where you want to put it. This is called
collision, or hash conflict. In this case, you can use the zipper method to solve it. Specifically, create a linked list in the conflict location. As shown in the figure below, the three BCD data conflict at position 1, so a linked list is formed here. The search in hash table is also very easy. First, find the location of the data to be searched by inserting, and then check whether the location has been found.
< center > Figure 4: hash means intention linked list method
One disadvantage of hash table is that
Support for range queries is unfriendlyFor example, if you want to look up the data between [F, k], you need to enumerate all the values between F and K to calculate hashcode, and check the hash table one by one. And he’s out of order,
Not friendly to order by。 Therefore, it is impossible to use hash table as index unless your query only has equivalent query.
- Why not search binary trees
Whether it is an unadjusted search binary tree, AVL tree or red black tree, it is a search binary tree. Its characteristic is that for any node, his left child (if any) is smaller than himself, and his right child (if any) is larger than himself.
The drawback of searching binary trees is that his
The height increases with the number of nodes。 We know that the database index is very large, and it is impossible to load it directly into the memory. The root node may be directly in memory, and other nodes may be stored on the disk when searching
You need to read the disk every time you look down。 Read disk
The efficiency is relatively lowTherefore, we need to reduce the number of disk reads, so we need to reduce the height of the tree. Search binary tree when there is a lot of data, the height will be very high, so the disk IO times will be a lot, inefficient.
In addition, the database is stored in the form of pages. By default, the InnoDB storage engine has a page of 16K. A page can be regarded as a node. A node in a binary tree can only store one data. If the index field is int, that is, a 4-byte number takes up 16K space, which is very large
- What are the characteristics of B + trees
High fan outHigh fan out enables a node to store more data, and the whole tree will be more
Chubby。 The height of a tree in InnoDB is 2-4 layers, which means that one query only needs 1-3 disk IO
- Non leaf nodes only store key values (that is, column values), which enables a page to store more data, which is the guarantee of high fan out
Clustered indexIt is also called cluster index and primary key index. His distinctive features are
Its leaf node contains row data (a row in the table)Yes, InnoDB storage engine table data exists in the index, and the table is
Index organization table。 Obviously, table data cannot have multiple copies, but it must have one. Therefore, there is only one clustered index in a table.
What kind of columns create a clustered index?
Primary key columnThat is, if you specify the primary key of a table, a clustered index will be created. Tables in InnoDB must have primary key columns. If no primary key is specified, a non empty unique column will be selected as the primary key. Otherwise, a column will be implicitly created as the primary key.
Suppose there is a table as follows, a is the primary key, and only three data can be put on a page
< center > Table 1: sample data table < / center >
Let’s take a look at what his clustered index looks like
< center > Figure 5: schematic diagram of aggregation index index tree < / center >
Where R1 to R4 represent rows numbered from 1 to 4, respectively
Benefits of using clustered indexes:
Quick inquiry, equivalent and range queries are fast. If you use an index, the query efficiency will be higher. Using a clustered index is faster than a nonclustered index query, because it can find data directly in the leaf node, and does not need to go back to the table (explained later)
- Based on primary key (clustered index)
Quick sortingThe data itself is sorted according to the primary key
Let’s create a table and take a look
Table creation statements and initialization data are as follows:
--A is the primary key create table t ( a int not null, b varchar(600), c int not null, primary key(a) ) engine=INNODB; insert into t values (1,'a',11), (2, 'b', 12), (3, 'c', 13), (4, 'd', 14);
< center > Figure 6: demonstration of clustered index query plan < / center >
For those who don’t know much about explain, please refer to the last reference at the end of the article
For the first query, we do the equivalent query on column a, and the second query on column C. From the key column, we can see that the first query uses the clustered index, and the second one uses the full table scan because C has no index
The third query sorts a and the fourth sorts column C. It is found that the sorting of primary keys does not use filesort
Nonclustered indexIt is also called secondary index, secondary index and non primary key index. An index created by a non primary key column is such an index. His distinctive features are
Leaf nodes do not include full row data(if included, what a terrible thing , but contains the corresponding row records
Again, for example, for the table above, we create an index in column B.
Notice that we only used the first 10 characters of B to create the index, so you can see sub_ The part column shows 10.
At this point, idx_ B this index corresponds to the B + tree, similar to the form below
< center > Figure 7: non clustered index tree
You can see that 1, 2, 3, 4 in the leaf node are actually the values in the primary key
The search process in a nonclustered index is as follows:
First, find the specified key in the nonclustered index tree, and at the same time, get the primary key. Take the primary key to find the corresponding row in the clustered index.
The process of finding rows from the primary key to the clustered index is called
Back to the tableIt is possible to avoid returning to the table. See the coverage index later.
Benefits of using nonclustered indexes:
It takes less space than a clustered indexBecause his leaf node does not contain complete row data, only contains the primary key key
Quick inquiryThis is similar to a clustered index, but may be less efficient than a clustered index because there is a table back process
The problem of returning to a table is to look up two index trees to find the data. Of course, it will be mentioned later that not all queries with nonclustered indexes have the process of returning to the table.
Let’s look at several query plans
< center > figure 8: secondary index query plan < / center >
The first key is IDX_ b. It shows that nonclustered index is used. Extra is an optimization after mysql5.6, and index push down optimization. In short, when using index query, unqualified data is filtered directly through the where condition.
The second one shows how to sort by columns of nonclustered index. It is found that filesort is used because it is impossible to sort directly according to the index, and it needs to return to the table.
The third is similar to the second, but he only selects the column B and finds that the file sort is not used. Because there is no need to return to the table, this actually uses the overlay index.
A federated index is an index
Contains multiple columnsIn this case, each key of the B + tree contains several parts instead of a single value.
Continuing with the example above, we build a union index on columns B and C.
< center > Figure 9: demonstration of creating a federated index < / center >
The possible forms of the index tree are as follows:
< center > Figure 10: schematic diagram of combined index tree
The picture is not very good. In fact, the second one can be disordered in one page
Each key consists of two column values, and the leaf node also contains the primary key. It can be seen that the federated index is a nonclustered index. Of course, the primary key index can also contain multiple columns, and naturally it can also be a federated index.
The role of the union index:
- This index can be used to query and sort the columns on the left (leftmost principle)
--It can be assumed that there is no IDX_ B this index select * from t where b='a'; select * from t where b='a' and c=11;
- The column on the left performs an equivalent query, which is friendly for sorting the columns at the back, because the latter column is already sorted
--It can be assumed that there is no IDX_ B this index select * from t where b='a' order by c;
- Let the index contain more data. Go over the index. Once a column is indexed, the index tree must contain the data of this column
For a string type column, the leftmost prefix principle is also met. If ‘a’ fails to hit the index, just like ‘a%’ is OK.
Note that there is no index in the statement below
select * from t where c=11;
Here are some query plans:
Let’s take a look at the index
You can see that we have established IDX in B and C columns_ B_ Joint index of C
< center > Figure 11: Federated index query plan < / center >
Query No. 1, the condition contains the leftmost column, column B, and hits the index
Query 2, the condition does not contain the leftmost column, the key column is displayed as null, the index is missed, and the type is all, which is a full table scan
No.3 query, the left most column is equivalent, and then the right column is sorted, and the index is hit
Query No. 4, failed to hit the index and used the file sort
Through these four queries, we can understand what the leftmost principle of Federated index is. Combined with the tree structure of Federated index mentioned above, this principle is taken for granted.
Covering means covering
containCovering the index means
The index contains the data you need。
A clustered index contains row data directly, so it is an overlay index, but it is not generally said. The nonclustered index data contains the column value of the index column (this is not exactly true, which is explained later). Overlay index is not a new index structure, just
The data you're looking for is in the index treeThat’s it
Query without returning to the table(the leaf node of a nonclustered index only has the primary key and the index column value. If you need other column values, you need to search through the clustered index once, that is to go back to the table).
If the overlay index is used, the extra column of the query plan is using index.
Here are some specific examples:
The current index is as follows
Some implementation plans
< center > Figure 12: overlay index execution plan demo
The index of C contains the values of C column and primary key column, so the first and second query does not need to return to the table and uses the overlay index.
The index of C does not contain column B, so when the index of column C looks up column B, it needs to go back to the table
In the fourth query, there is an index on column B, and the value of column B is in the index. The index covers the column to be queried, so the overlay index is also used.
It’s important to note that don’t forget
Primary key columns can be overridden in all indexes。
The test found a strange phenomenon. Here we share it with you. The varchar of a column is longer than 767, and then index it. There will be an automatic interception. As shown in the figure:
< center > Figure 13: demonstration of varchar long index interception < / center >
You can think about it. If your index key is only a part of the column, for example, if there is a field named varchar (100), and your index only contains the first 50 characters, can you override the index at this time?
Show index from table nameYou can see that there is a cardinality column, which measures us
Index validityThe way. It means the number of non duplicate rows in the index column. Cardinality divided by the number of table rows is called
The higher the selectivity, the betterWhen the selectivity is less than 30%, it is generally considered that this index is not well built.
Cardinality is a
Sample estimateSeveral pages are randomly selected to calculate the average number of different records, and then multiply the number of pages. So you may find different values each time, even if your table is not updated.
This value is not calculated every time the table is updated. It will have its own calculation strategy.
Execution of the following statement will cause the value to be recalculated. Of course, it can be configured to not calculate:
- analyze table
- show table status
- show index
B + tree index management
- Create when creating a table
create table t4 ( id int primary key, a int not null, key(a) );
- Create by modifying the table
alter table t4 add index idx_a (a);
- Create through create index
create index idx_a on t4(a);
- Modify table delete
alter table t4 drop index idx_a;
- Drop index syntax
drop index idx_a on t4;
show index from t4;
Thinking about index
Learning B + tree index, the most fundamental is to understand the various
The structure of index treeHow to achieve “tree in the heart”. When we see an optimization strategy, we can know the optimization strategy
Why can we optimize。 Based on our understanding of the index structure, we can even propose some new optimization strategies (which are new to you, but may have been written or used by others). For example, we know that every nonclustered index leaf node contains a primary key, so our
The primary key should be as small as possible under the condition of satisfying the businessThis can reduce the space of all indexes. Of course, each column data type should be as small as possible.
The road of index is long and obstructed, Ollie!
- MySQL technology insider InnoDB storage engine