Why do databases need indexes?
We all know that the data of the database is stored on the disk. When our program starts, it is equivalent to a process running in the memory of the machine. Therefore, when our program wants to query data, we must come out of memory to the disk to find the data, and then write the data back to memory. However, the IO efficiency of disk is far lower than that of memory. The speed of searching data directly affects the efficiency of program operation.
The main purpose of database indexing is to use an appropriate data structure, which can improve the efficiency of querying data, reduce the number of disk IO, and improve the data search rate, rather than the global traversal.
Why does the index use the B + tree data structure?
If we simply want to find the data quickly, we feel that the hash table is the fastest. According to the key, hash to a slot and directly find the location of the data at one time. How fast. However, when doing business, we often only need one piece of data, and most of the requirements are to query part of the data according to certain conditions. At this time, hash display is not very appropriate.
Let’s consider trees, such as binary tree, balanced binary tree, red black tree, B tree, etc. they are binary search and find numbers quickly. However, whether it is balanced binary tree or optimized red black tree, in the final analysis, they are binary trees. When there are many nodes, their height will be high. I’ll find a data. If the root node is not, then go to the next layer. Before the next layer is available, I will go to the next layer. As a result, I may have to find a data several times, and each time I execute the disk IO, and the purpose of our index is to reduce the disk IO. This design is not good. Can we just reduce the height?
So let’s consider the B tree again. First, briefly introduce the data structure of B tree:
First, let’s look at the definition of B-tree.
- Each node can have up to M-1 keywords (key value pairs that can be stored).
- The root node can have at least 1 keyword.
- Non root nodes must have at least m / 2 keywords.
- The keywords in each node are arranged from small to large. All keywords in the left subtree of each keyword are smaller than it, while all keywords in the right subtree are larger than it.
- All leaf nodes are located in the same layer, or the length from the root node to each leaf node is the same.
- Each node stores indexes and data, that is, the corresponding key and value.
Therefore, the keyword quantity range of the root node is 1 < = k < = M-1, and the keyword quantity range of the non root node is m / 2 < = k < = M-1.
M here represents the order. The order represents the maximum number of child nodes of a node. Therefore, when describing a B tree, you need to specify its order.
Let’s give another example to illustrate the above concepts. For example, there is a 5th order B tree. The number range of root nodes is 1 < = k < = 4, and the number range of non root nodes is 2 < = k < = 4.
Next, we will explain the insertion process of B-tree through an insertion example, and then explain the process of deleting keywords.
When inserting, we need to remember a rule: judge whether the number of keys of the current node is less than or equal to M-1. If yes, insert it directly. If not, divide the key in the middle of the node into left and right parts, and put the middle node in the parent node.
Example: in a 5th order B-tree, a node has at most 4 keys and at least 2 keys (Note: the following nodes uniformly use one node to represent key and value).
Insert 18, 70, 50, 40
When inserting 22, it is found that the keyword of this node is greater than 4, so it needs to be split. The splitting rules have been described above. After splitting, it is as follows.
Then insert 23, 25, 39
Split and get the following.
Therefore, the number of nodes in each layer of the B tree will become more. With the same amount of data, the B tree will be lower than the binary tree, and the IO times will become less, so it meets our index requirements. Why did MySQL finally choose the B + tree? Where is the better place than the B tree?
Let’s first look at the differences between B + tree and B tree:
- The leaf node of the B + tree contains all the key values of the tree. Non leaf nodes do not store data, only indexes, and the data is stored in the leaf node. The B-tree is that each node stores indexes and data.
- Each leaf node of the B + tree contains pointers to adjacent leaf nodes, and the leaf nodes themselves are linked from small to large according to the size of keywords.
As shown in the figure:
First point: when non leaf nodes only store index keys but not data, the occupied space of non leaf nodes can be reduced, and nodes with the same capacity can store more indexes. If it is also a three-tier B + tree, the order will become more and more data will be stored than B tree.
Second point: the leaf node of the B + tree contains pointers to adjacent leaf nodes. To understand the benefits of this pointer, we first know that when reading data from the disk, it is often not read strictly on demand, but will be read in advance every time. Even if only one byte is required, the disk will start from this position, read a certain length of data backward in sequence and put it into memory. The theoretical basis for this is the famous locality principle in Computer Science:
- When a data is used, the nearby data is usually used immediately.
- The data required during program operation is usually concentrated.
The length of the preview is generally an integral multiple of the page. Page is the logical block of computer management memory. Hardware and operating systems often divide the main memory and disk storage area into consecutive blocks of equal size. Each storage block is called a page (in many operating systems, the size of the page is usually 4K). The main memory and disk exchange data in page units. When the data to be read by the program is not in the main memory, a page missing exception will be triggered. At this time, the system will send a disk reading signal to the disk. The disk will find the starting position of the data, read one or more pages backward continuously, load them into the memory, and then return to the exception, and the program continues to run.
Now let’s look at the pointer of the leaf node of the B + tree. We can see its usefulness. During pre reading, we can ensure that the data read continuously is in order.
Some students may have mentioned the B * tree, which is based on the B + tree and adds linked list pointers to non leaf nodes. Personally, I don’t think it’s necessary to use the B-STAR tree. We don’t store data in non leaf nodes. Data is in leaf nodes, and the linked list pointer can’t be used in non leaf nodes.
Some fancy concepts
Clustered index and non clustered index: we mentioned above that the leaf node of the B + tree stores the data of the index key, but different MySQL engines have different choices for storing data. MyISAM stores the index file and the real data file in two files. The data stored in the index file is the address value of the data corresponding to the index key in the data file, InnoDB stores the formal data in the leaf node. Therefore, clustering and non clustering distinguish whether the data stored in leaf nodes is real (can be understood as whether leaf nodes are crowded?)
Back to table: back to table is also simple, but you must first understand the primary key index and ordinary index. The leaf node we mentioned above stores real data, which is only stored in the primary key index. The data stored in the ordinary index is the key of the primary key index. Then we can understand. For example, I now build a common index for the name field of a table. I want to select * from table where name = ‘test’. When we find the test node, the key we get is only the primary key corresponding to this row of data. To get the data of the whole row, we can only take this key and go to the primary key index tree again. This operation is called back to the table.
Leftmost matching principle: when we create a new composite index, such as (name + age), when we use where name = XX and age = XX, the composite index will be taken, while where age = XX and name = XX will not be taken. This is because MySQL’s rule for creating a federated index is to first sort the leftmost first field of the federated index, based on the sorting of the first field, and then sort the second field.
The above is the details of the advantages of MySQL using B + tree as an index. For more information about the advantages of MySQL using B + tree as an index, please pay attention to other developeppaer articles!