MySQL indexing principle

Time:2021-12-30

Before we start, let’s introduce oneData structure visualization website, which can visually operate various data structures. Easy to understand the underlying principles.

Meaning of index

When we were in primary school, we all had the experience of using a dictionary. To find a word, we usually look it up on the index page in front of the dictionary through pinyin or radical. Find the page number of the associated word, and then jump to the corresponding page number to find it. It greatly increases the efficiency of query.

The scenario of using a dictionary reflects the idea of indexing. In short, index is the “intermediate” data organized according to a certain structure in order to improve the retrieval efficiency when retrieving specific data in a large amount of data. When we query, we first check the “intermediate” data, then get the “address” of the final target data according to the “intermediate” data, and go to this address to get the data we are looking for.

The same is true when we use MySQL database. If there are millions or even tens of millions of data in the table, if there is no index, we want to retrieve a specific row of data, which will be extremely inefficient.

The essence of inefficiency: a lot of IO operations! Here we must note that if the data is in memory, the simple comparison operation is very fast. The problem is that each query needs to load the disk data into memory for calculation, which is the root of using the index!

In MySQL database,IndexesIt helps MySQL get data efficientlyPut it in orderofdata structure

  • The index satisfies a data structure
  • The index itself is sorted
  • Indexes are essentially data

Data structure selection of index

Binary tree

For example, look at the following table, with two columns of fields:

MySQL indexing principle

Binary tree index png

Suppose we need to use the second field as the query condition to query the first data satisfying ‘col2 = 89’, as shown in the following statement

select * from t where col2=89 limit 1;
  • Assuming that there is no index, MySQL will search one by one and compare whether the value of col2 is 89 until it matches the row of ‘col2 = 89’. This example will compare 6 times to complete the query.
  • Suppose an index is established for the col2 column, and the index structure is an ordinary binary tree, that is, put the data of col2 into the binary tree structure, Each node of the binary tree stores a key value pair, the key stores the value of the field, and the value stores the corresponding disk storage address of the row record (for example, the address corresponding to the first row is 0x07). When searching again, the database will start from the root node of the index data structure according to the value brought in by the query criteria. The value corresponding to the root node is 34 < 89, so continue to query the right child node. The first right child node is 89, which meets the query criteria. Take out the value corresponding to the node (i.e. storage address), go to the disk to get the data of this row. This time, only two comparisons are made, that is, the query is completed.

The above example shows that using binary tree as the index can often speed up the retrieval efficiency. But we stillNoTake binary tree as the index data structure type of MySQL, because it has fatal defects. For example, we use a binary tree to index the first column (i.e. the self growing ID primary key field) to see what problems will occur. This can be found inData structure visualization websiteAfter the simulation operation, the following index data will be obtained:

MySQL indexing principle

Defects of binary tree

You can see the problem at a glance. Due to the characteristics of binary tree structure, if the child node is larger than the parent node, it is on the right of the parent node; Otherwise, it is on the left of the parent node. We found that the index data structure of binary tree eventually evolved into a linked list for unilateral data such as ID. when searching according to this index, the essence is to search one by one.

Red black tree

From the above counterexample, we can see that the ordinary binary search tree can degenerate into a linked list in extreme cases, and the efficiency of addition, deletion and query will be relatively low. In order to solve this situation, there are some self balanced search trees, such as AVL, red black tree and so on. By defining some properties, these self balanced search trees control the height difference between the left and right subtrees of any node within the specified range to achieve the equilibrium state.
Taking the red black tree as an example, the red black tree realizes self balance through the following property definitions:

Nodes are red or black.
The root is black.
All leaves are black (leaves are nil nodes).
Each red node must have two black child nodes. (there cannot be two consecutive red nodes on all paths from each leaf to the root.)
All simple paths from any node to each leaf contain the same number of black nodes (black height for short).

The specific principle of red black tree is not discussed here. Just remember that it will automatically balance the height difference between the left and right subtrees of nodes, so the above extreme situation will not occur. Friends interested in specific principles can view this post:This article takes you to thoroughly understand the red and black tree

Here, we can see how the red black tree solves the index problem of self growing fields through visual display:

MySQL indexing principle

Red black tree

Through the above animation demonstration, you can find that when we try to continuously add right sub nodes to the node, the red black tree will automatically balance the height of the left and right sub trees of the node through rotation. In this way, it solves the above problem that in extreme cases, the binary tree degenerates into a linked list.

So the red and black tree is perfect? In fact, it also has defects. The essence of red black tree is still the category of binary tree, so when there are more data in the table, the height of the tree is relatively large. For example, there are 1 million pieces of data in the table. If we index the ID, the height of the tree is about 20. (calculation method of tree height: 2 ^ n = 100W, where n is the tree height). At this time, the efficiency of query is relatively poor, because most of the data is at the bottom of the tree. A query from the root node to the target node often goes through more than ten comparisons, that is, more than ten disk IO operations. Note that disk IO operations are very time-consuming, which greatly affects the query Performance.

B tree / B + tree

MySQL indexes support two data structures. B + tree is one of them and is the most commonly used.

From the defects of the above binary tree, it is easy to think that if we want to control the query efficiency of the tree structure (for example, the disk IO times of a query are controlled within 5 times), we must control the height of the tree. When the total amount of table records remains unchanged, to control the height of the tree, we need to store more index elements in one node, which is the B tree.

MySQL indexing principle

B tree
  • The leaf node has the same depth, and the pointer of the leaf node is null
  • All index elements are not duplicate
  • The data indexes in the node are arranged incrementally from left to right

B-tree has been able to better solve the above problems, but we don’t intend to discuss B-tree too much here. Because the MySQL R & D engineers who keep improving further optimize the B tree and get the B + tree, which is the index structure of MySQL.

MySQL indexing principle

B + tree
  • The leaf node contains all index fields and is arranged according to the values of the index fields
  • Non leaf nodes do not store data, only store indexes (redundant), and can put more indexes
  • Leaf nodes are linked with pointers to improve the performance of interval access

Let’s explain the above three properties of B + tree in detail:
Since the data capacity that each node can store is limited, the default setting of this capacity limit is 16K, which can be viewed through the following statement:

SHOW GLOBAL STATUS like 'Innodb_page_size';

Since non leaf nodes do not store data, they can store more index elements. Between each two index elements, there is also a pointer to the child node (the occupied size is 6 bytes). All index elements of the child node pointed to by the pointer are ≥ the index element on the left of the pointer and < the index element on the right of the pointer.
Now review the above example. Suppose we use bigint to store ID fields. Bigint accounts for 8 bytes, plus 6 bytes for each pointer, the storage occupied by an ID index element is 14 bytes, and a non leaf node can store up to 16kb / 14b = 1170 entries.
For leaf nodes, there is no pointer to the next level, and 6 bytes are omitted, but data is stored in it. According to different MySQL storage engines, this data may store the disk storage address of the row where the index data is located, or it may be a collection of other segments of the row where the index is located. (as for the storage engine, it will be introduced below.) assuming that the data of an index accounts for 1K, a leaf node can store about 16 index fields. When the tree height is three layers, the total number of indexes that the B + tree structure can store is 1170117016=21902400。 Now, the structure of the B + tree is really perfect. The three-tier height can accommodate tens of millions of data indexes. You can also see that the default size of each node is 16K, which is also very appropriate.

The advantage of pointer linking between leaf nodes is that when we perform interval search, the efficiency will be greatly improved. For example, we need to execute the following statements

select * from t where id>15 and id<30;

MySQL will start the query from the root node, find the first node and find the record with ‘id = 18’ that meets the conditions, then jump to the second node through the pointer and continue to find the record with ‘id = 20’ that meets the conditions. Imagine that if the leaf nodes are not connected by pointers, you have to go back to the parent node or even the root node of the previous level to continue searching.

Let’s take a visual look at the process of B + tree index creation:

MySQL indexing principle

B + tree

Let’s find the sixth row of records according to the index established by the B + tree:

select * from t where col1=6;

At this time, 6 will be used to compare with the root node. Note that the elements of the root node are often pre loaded into memory. After comparison, it is found that 6 is larger than all the elements of the root node, so it will directly jump to its last child node, that is, load the data of the third child node into memory for comparison, and you can quickly locate the corresponding data. (note here that it is impossible to load all index data into memory in advance in any index structure form).

The above example is relatively simple and the table data is relatively small, but the reason is the same. Even if the table data reaches tens of millions, the tree height can still be controlled at 3-5 layers by using the index structure of B + tree, soat mostOnly 4-5 IO operations are needed to retrieve the required records.

Hash table

MySQL index supports two data structures, hash is one of them, and it has its unique advantages in some special scenarios.

Hash (hash, hash) function:Is to map arbitrary length data to finite length fields. Intuitively, it is to hybridize a string of data m and output another piece of fixed length data h as the feature (fingerprint) of this piece of data.
That is, no matter how large the data block m is, its output value H is a fixed length. For example, MD5 and Sha, which are commonly used, are hash functions.

Hashtable Is based on the set hash function H (key) and conflict handling methods map a group of keywords to a limited address interval, and take the image of the keywords in the address interval as the storage location recorded in the table. This table is called hash table or hash, and the resulting storage location is called hash address or hash address. As a linear data structure, hash table is undoubtedly a comparison of search speed compared with tables and queues Fast one.

When we build a hash structure index on a table field, MySQL will perform hash operation on the value of the field and save the obtained hash value together with the disk address of the row record to the hash mapping table:

MySQL indexing principle

Hash

Due to the particularity of hash index structure, its retrieval efficiency is very high. The retrieval of index can be located at one time. Unlike b-tree index, which needs multiple IO accesses from root node to branch node and finally to page node, the query efficiency of hash index is much higher than that of b-tree index.

However, hash index also has its defects:

  • The hash index can only satisfy “=”, “in” and “< = >” queries, and cannot use range queries.
  • Hash indexes cannot be used to avoid sorting data.
    Because the hash value after hash calculation is stored in the hash index, and the size relationship of the hash value is not necessarily the same as the key value before hash operation, the database cannot use the index data to avoid any sort operation;
  • Hash indexes cannot be queried with partial index keys
    For the composite index, when calculating the hash value, the hash index calculates the hash value together after combining the composite index keys, rather than calculating the hash value separately. Therefore, when querying through one or more index keys in front of the composite index, the hash index cannot be used.
  • Hash indexes cannot avoid table scanning at any time
    Hash index is to store the hash value of the hash operation result and the corresponding row pointer information in a hash table after the index key passes the hash operation. Because different index keys have the same hash value, even if the number of records of data satisfying a hash key value is taken, the query cannot be completed directly from the hash index, You should also access the actual data in the table for corresponding comparison and get the corresponding results.
  • When a large number of hash values are equal, the performance of hash index is not necessarily higher than that of BTREE index
    For index keys with low selectivity, if a hash index is created, there will be a large number of record pointer information associated with the same hash value. In this way, it will be very troublesome to locate a record, which will waste multiple access to table data, resulting in low overall performance.

Storage engine

The data in MySQL database is finally stored on disk in the form of files. Like other files on our disk, whether word files or text files need a storage method. This requires a storage engine whose function is to organize files on disk according to certain rules.

MySQL supports a variety of storage engines, of which MyISAM and InnoDB are the most common.

Here, for example, two tables are created. One table is called teacher and uses MyISAM storage engine; A table named student uses InnoDB storage engine; We can go to the MySQL installation directory to see the files corresponding to the two tables:

# ls -l
total 144
-rw-rw---- 1 mysql mysql     65 Jan 21 02:27 db.opt
-rw-rw---- 1 mysql mysql   8586 Jan 21 05:54 student.frm
-rw-rw---- 1 mysql mysql 114688 Jan 21 05:54 student.ibd
-rw-rw---- 1 mysql mysql      0 Jan 22 01:38 teacher.MYD
-rw-rw---- 1 mysql mysql   1024 Jan 22 01:38 teacher.MYI
-rw-rw---- 1 mysql mysql   8586 Jan 22 01:38 teacher.frm

You can see:

  • Using the table of MyISAM storage engine, three files are generated on disk: teacher frm,teacher. MYI,teacher. MYD;
  • Using the table of InnoDB storage engine, two files are generated on disk: student frm,student. ibd;

MyISAM

  • .frm: is the abbreviation of frame, which means “frame and structure”. This file stores the structure definition of the table;
  • .MYI: is the abbreviation of MyISAM index. This file stores the index data of the table;
  • .MYD: is the abbreviation of MyISAM data. This file stores the content data in the table;

Using MyISAM as the storage engine, the index file and data file are separated, so this index is also calledNonclustered index

MySQL indexing principle

MyISAM storage png

The data in the leaf node of the non clustered index stores the disk storage address of the row record, so the index data is relatively light, but the results cannot be obtained directly when querying according to the index. It is necessary to perform an additional IO operation according to the record address obtained by the index.

InnoDB

  • .frm: is the abbreviation of frame, which means “frame and structure”. This file stores the structure definition of the table;
  • .ibd: is the abbreviation of InnoDB data. This file stores the index and data of the table;

InnoDB table has the following properties:

  • The InnoDB table must have a primary key, and it is recommended to use an integer self incrementing primary key
  • The table data file itself is an index structure file organized by B + tree
  • Clustered index – leaf nodes contain complete data records
  • The leaf node of the auxiliary index structure stores the primary key value

Why must InnoDB tables have primary keys?Someone might say, “when I was creating a table, I didn’t specify a primary key, so I created it successfully.”. You may not know that if you do not manually create a primary key, MySQL will find a column with unique data as the primary key, or add a column primary key (you can’t see the primary key column automatically added by MySQL). And use this primary key field to organize the table data according to the B + tree:

MySQL indexing principle

primary key

Why integer autoincrement?Suppose you use a 32-bit UUID as the primary key. Let’s see the effect:

  • When querying in the B + tree, the size of the elements in the node is compared, and the string comparison is relatively slow. The string comparison is first converted to acsii code, and then the size is compared according to the sorting of acsii code. Using integer is different. The comparison of numbers is very fast, and integer takes up less bytes.
    Why self increase
  • When inserting in the B + tree, because the UUID is randomly generated, the newly generated UUID is often not the largest, so it is not appended to the tail, but is likely to be inserted into a leaf node in the middle. Then the leaf node may be full, and the subsequent leaf nodes need to be changed together, which has a large performance overhead. We use the visualization tool to demonstrate what happens when the inserted primary key is not self incrementing. Suppose the primary key value already has: 1, 2, 3, 4, 5, 6, 7, 9, 10, then insert a new primary key value: 8 to see what happens:
MySQL indexing principle

Primary key non self increasing defect

As can be seen from the above animation, the newly added primary key value 8 is inserted into the second position of the last leaf node according to the sorting rules of size. Since the storage upper limit of the leaf node is reached, the B + tree reconstructs the tree according to the internal optimization rules. Not only the layout of leaf sub nodes has changed, but also the height of the tree has changed, which is very performance consuming.

Indexes and data are organized in one file, so this kind of index is also calledClustered index。 The primary key index discussed above is a clustered index.
Since the clustered index operates in only one IBD file, it is more efficient.

Secondary index(secondary index) also known asSecondary indexorsecondary index , it is very similar to the primary key index. The only difference is that the data part of the leaf node stores the corresponding primary key value instead of the record value of the row.

MySQL indexing principle

Secondary index png

The leaf node of the secondary index has two advantages:

  • Save storage space. Only one copy of the data in the table needs to be saved in the primary key index;
  • It is convenient to maintain data consistency. When table data changes, just change the data in the primary key index;

The disadvantages of secondary indexes are also obvious:

  • You need to find the corresponding primary key value according to the secondary index, and then search in the primary key index according to the primary key value.

InnoDB support for hash

From the above hash index structure, we know that the values in the hash table store the disk storage address corresponding to the row record, so its index file and data file are separated. The InnoDB storage engine is designed to store indexes and data in the same file. Therefore, when InnoDB is used as the storage engine, we cannot create the index of hash structure. At this time, we can only use B + tree;

But as we said above, for some special scenarios, the performance of hash is better than that of B + tree. Can’t you have both fish and bear’s paw?
In fact, the InnoDB storage engine of MySQL supports hash index, but its hash index is adaptive, that is, InnoDB will automatically generate hash index for the table according to the usage of the table. This process cannot be considered as intervention.

Joint index

Unfinished to be continued…

Recommended Today

Redis featured Q & A

Redis data type type brief introduction characteristic scene String (string) Binary security It can contain any data, such as JPG pictures or serialized objects. One key can store up to 512M It can be used to do the simplest data. It can cache a simple string or a JSON format string. The implementation of redis […]