MySQL engine, index and algorithm
After understanding the basic curd operation of MySQL, the next knowledge we must master is the index of MySQL.
During the interview, I often like to ask about the knowledge of MySQL from the superficial to the deep, so as to know what level the candidate has learned about mysql.Last articleThe knowledge in the book is too basic for me to ask. So the first question I’ll ask must be the index of MySQL.
About MySQL index, I will ask the following questions:
- Do you know what algorithm InnoDB index uses?
- Why should InnoDB use B + trees instead of other data structures?
- In InnoDB, is it necessary to have a primary key? What happens if you don’t specify a primary key when creating a table?
- What is the difference between the primary key and index of InnoDB?
To answer these two questions, we need to know the following knowledge: engine, index, tree
Background knowledge of MySQL index
The engine of MySQL
MySQL allows different engines to be embedded at the beginning of its design. The core algorithm of database is actually implemented by engine. The early MySQL database had the following three main engines:
MyISAM: This is the default engine before MySQL 5.5. Because it doesn’t support transaction processing, it is almost useless in the new system.
InnoDB: This is the default engine for MySQL 5.6 and beyond. If you don’t know what engine to choose, it’s basically right.
Memory: This is a special engine. All the data accessed by this engine are stored in the memory and will not fall into the disk. Therefore, when the database is down or restarted, the data will be lost. Since the rise of redis, the memory engine has declined.
Since almost all of the new systems use the InnoDB engine, the software principles and behaviors under the InnoDB engine are referred to below unless otherwise specified.
“Page” in storage system
According to reference1The key features of InnoDB engine include the following:
- Insert buffer
- Double write
- Adapitve hash index
- Asynchronous IO (async IO)
- Refresh the flush neighbor page
You can see that four of the five features are directly related to storage. If you have learned the principle of computer composition, you will know that according to the distance between computer storage and CPU, there are the following:
- Hard disk
Among them, the register is generally not need to pay attention to for programmers; cache can not directly affect and operate in the program. For most computer programs, the memory and hard disk are used. When the operating system reads the memory and hard disk, it basically operates in the unit of “page”.
Why do you need to operate in “pages”? Let’s look at the memory first: memory can be read randomly, that is to say, if the CPU wants to read the data on which address, then an instruction can get it. But the access on the application program is not the real memory, but the virtual memory. The operating system can only map virtual memory in pages. Therefore, even if you are working on memory, please consciously try to align pages.
The point is hard disk. There are two types of hard disk, one is disk, that is, the medium to store data with magnetic components; the other is the so-called SSD, that is, solid state hard disk. About the principle and process of disk reading, I have written two articles before《Notes on high performance disk I / O development》And《Notes on high performance disk I / O development》。 In short, due to the limitation of hardware principle, the reading and writing of hard disk has the following two characteristics:
- Slow: the disk needs to be rotated to the corresponding location; SSD is much better, but it is not as good as memory. After all, it needs to be loaded into memory from the long bus
- Block: the page used in software can often correspond to block in hardware. Data modification of disk and SSD is based on block
The principle of index
MySQL stores a lot of data. The target of data stored in each table is millions of rows; if the data structure is relatively simple and the index efficiency is high, there is no problem. In actual use, there are also hundreds of millions of scenarios. This is like a library. We need to mark and index every book, so that when we look up the bibliography (data), we can efficiently find the data we need.
The principle of index, in essence, is to solve the problem of quick search and quick modification. The second is to solve the very tangled hard disk writing process. In the whole process, we need to prevent crash and downtime. After all, the data consistency requirement of MySQL is very high.
As mysql, the following data structures often need attention: hash table, B-tree, B + tree.
Hash algorithm I believe we all understand, this article will not repeat. The time complexity of hash algorithm is O (1). In mysql, the memory engine is the only one of the three main engines mentioned above, which uses hash algorithm in the index. So why don’t other engines use this algorithm? Because other engines need to consider the problem of landing hard disk.
Although the hash algorithm is simple, the expansion and reduction of hash table should be considered in practical application. When the hash table needs to be expanded / shrunk, all elements in the whole table may need to be rearranged. The memory engine does not drop the disk and does not care. But even memory is not suitable for storing large amounts of data. In fact, in real life, the usage scenarios of memory have been continuously compressed, and most of them have been replaced by redis.
The principle of B-tree is relatively simple. It is a tree. Compared with the most basic tree structure, B-tree is more special in tree splitting and merging. It mainly occurs when the content of the database increases and decreases. Specific process, readers can refer to the relevant information on the Internet, a lot.
The characteristics of B-tree are as follows
- Each node can be multiplexed, not a binary tree. The query efficiency is certainly weaker than binary tree.
- Data is stored on each of its nodes
It’s not mysql
MongoDBB-tree is used. So the question here is: why use B-tree instead of red black tree? Pay attention to the interview
- First of all, each node of B-tree has a certain length. In the design of the engine, we will make full use of this feature and put the same node on the same page in combination with the “page” mentioned above, which greatly improves the access efficiency of the hard disk
- Secondly, first of all, in the process of inserting the red black tree, the rotation of nodes often occurs. After many times of rotation, the nearby nodes may be scattered in multiple pages of the hard disk. When the data is landing, it will greatly reduce the efficiency and increase the risk of failure
It should be noted that B-tree is sometimes called B-tree, but in some articles, B-tree is not B-tree, but binary tree. When readers identify these words, they should distinguish them according to the context.
B + tree
B + tree is the focus of this paper, because the tree structure used by InnoDB is B + tree. The schematic structure of a B + tree is as follows:
It looks like a B-tree, but actually there are two key differences:
- The data of B-tree not only exists in leaf nodes, but also in branch nodes. But the data of B + tree is only stored in leaf node, and branch node only keeps index. If you want to find the data, you must find the leaf node.
- There is no other relationship between nodes of B-tree except parent-child relationship. But between the leaf nodes of B + tree, there are two-way linked lists. The advantage of this point is that for the design traversal operation, or offset – limit operation, it can greatly improve the search efficiency
Classification of InnoDB index
As mentioned earlier, the algorithm used by InnoDB is B + tree; the non leaf nodes on B + tree only store the index of data structure, which is used to locate the child nodes, not “database index”.
So the question is: what does the leaf node of InnoDB B + tree store? This leads to the first category:
Differentiate by storage content
Clustered IndexThere are different translations in Chinese, such as “clustered index”, “clustered index” and “clustered index”. Clustering index means that the data stored on the leaf node is a complete row of MySQL data.
So in the B + tree, what is used to index leaf nodes? The answer is the main key. In practical application, a large number of tables will define the first column as
bigintType, and specified as
auto incrementType and set as the primary key. This is a very general and very safe approach. We can find that it is very efficient to query the self increasing ID directly, or to operate the range greater than or less than based on the self increasing ID.
What happens if you don’t specify the auto increment ID when creating a table? B + tree failure?
For MyISAM engine, the primary key is not necessary. If you do not specify a primary key, there will be no primary key. But the primary key is necessary in InnoDB. If the primary key is not specified, InnoDB will implicitly add a 24 bit integer ID as the primary key. However, this will make the integer ID invisible, resulting in some related operations, such as
last inserted idBecome meaningless. Therefore, in practice, we still need to explicitly specify the primary key.
For InnoDB, a clustered index can be equivalent to an index that is a primary key.
Secondary IndexThe Chinese translation is also different, including “non clustered index”, “auxiliary index”, “secondary index” and so on. On the leaf node of non clustered index, the primary key of the corresponding row of MySQL data is stored.
If an entry is found through a non clustered index, that is, a field other than the primary key, InnoDB only gets two data: one is the value of the index column of the current node, and the other is the primary key. If the client requests other data, InnoDB needs to look up the cluster index of the current table. This action is called a “back to table” query.
Distinguish by composition logic
According to the composition logic, InnoDB index can be divided into:
- Primary key index: This is the clustering index mentioned above
- Single column index: the simplest mode of non clustered index except primary key
- Union index: as the name suggests, it is a multi column index
- Unique index: This is a special case of single column index and joint index. The difference is that in the whole table, only one column is allowed in the same data row / group of values specified by single column or multi column conditions
What needs to be specially explained here is the union index. I always thought that the joint index is to index the next field in the result after indexing a field. But later, after consulting the information, I found out that it was not.
When a joint index is created, the value of each field in the index will appear in the data structure of the index. Here I thinkThis articleIt has been very accurate and brief. Readers can refer to it directly.
Overlay index is not a kind of index category, but a kind of query situation. As mentioned earlier, in most of the queries based on the index, you need to query back to the table to get other fields that the client needs. However, as mentioned in the previous article, if the current index of the field you are querying has been completely covered, InnoDB will not conduct redundant table return queries at this time, but will directly return the field in non clustered index queries. This phenomenon is called “covering index”.
InnoDB started to support spatial index in 5.7.4 labs. In short, our usual index is a latitude, such as a number X. The spatial index is the index of a spatial coordinate system, such as (x, y) or (x, y, z).
InnoDB index adopts R-tree. Readers can refer to relevant materials for further study if they are interested. In most application scenarios, if geographic data is not involved, we use less spatial index.
Answer the interview questions
Well, we can answer the interview question roughly
|Q:||What is the algorithm used for InnoDB index?|
|A:||B + tree|
|Q:||Why should InnoDB use B + trees instead of other data structures?|
|A:||Compared with the red black tree, the nodes of the B-tree are pages, and the pages are bound to the pages in the hard disk, so the efficiency of hard disk access can be optimized|
|Compared with the red black tree, the depth of the B-tree is more stable, and the search time is more predictable – this is actually determined by the splitting and rotation strategy of the B-tree, and readers can read the materials for further understanding|
|Compared with B-tree, B + tree contains two-way linked list between leaf nodes, which can greatly optimize the time-consuming of traversal class and offset – limit class query|
|InnoDB uses non clustered index in B + tree. This algorithm can greatly reduce the space occupied by the index, thus greatly reducing the memory and hard disk space occupied by the index, and improving the efficiency of index reconstruction|
|In fact, this is not the only answer. If you are interested, you can further read the reference materials|
|Q:||In InnoDB, is it necessary to have a primary key? What happens if you don’t specify a primary key when creating a table?|
|A:||As we have already answered, the primary key is required. If it is not specified, InnoDB will automatically create a 6-byte self incrementing ID|
|Q:||What is the difference between the primary key and index of InnoDB?|
|A:||The primary key of InnoDB is a special kind of index, that is, clustered index, while other indexes are non clustered indexes. The difference is that the cluster index stores a complete row of data, while the non cluster index stores index values and primary keys|
- MySQL technology insider – InnoDB storage engine (2nd Edition)
- The difference between B tree and B + tree
- Balanced binary tree, B tree, B + tree, b * tree, understand one of them, you will understand
- The structure of union index on B + tree
- Storage structure and data searching method of union index on B + tree
- Data structure and algorithm principle behind MySQL index
- I thought I knew a lot about MySQL index until I met Ali’s interviewer
- R-tree – Wikipedia, the free encyclopedia
- MySQL spatial index
- Spatial index – spatial index usage report of each database
- Using geospatial data geometry in MySQL (1)
This article usesCreative Commons Attribution – non commercial use – same way sharing 4.0 international license agreementPermission.
Original author:amc, welcome to reprint, but please indicate the source.
MySQL: engine, index and algorithm
Release date: November 9, 2020
Link to the original text:https://cloud.tencent.com/developer/article/1336510, which is also my blog