MO_ On SQL optimization

Time:2020-12-1

1、 Introduction

This paper is a review summary of SQL optimization. It mainly records how to use index to optimize SQL. The database is mysql. It is mainly discussed from three parts.
Part one: understand the underlying data structure of MySQL index.
The second part: explain in detail.
The third part: MySQL index best practices.

It is strongly recommended that:Due to the long length of this paper, there are many contents. It is recommended that readers read only part of the book at a time. Please do not read it all at once (not conducive to digestion and absorption, except for the big guy).

2、 MySQL index

2.1 brief introduction to index

Index is a sort of ordered data structure to help MySQL get data efficiently.

2.2 data structure of index

First of all, the data structure of MySQL index is B + tree.

From binary tree to B + tree, we can gradually understand why MySQL index uses B + tree.

Before we begin to discuss the specific data structure, we should have a preliminary understanding of what data structure is and what is the role of data structure?
If you know design patterns, or read my "Mo"_ Or's single case mode review summary.
Design patterns provide quality solutions for different types of problems.
Correspondingly, the data structure provides a high-quality storage scheme for different data,
Of course, the same as the design pattern, these solutions are also obtained through the countless practice and error improvement of the predecessors.
Then, the following will formally enter the explanation of several data structures.

2.2.1 binary tree

Before we understand binary trees, we need to understand why indexes use data structures?
Combined with the figure below, if we need to read col2 = 89 without using the data structure (left side of the picture), then the disk needs to do 6 I / O.
If the binary tree is used to store data (right side of the picture), the disk only needs 2 I / O, which significantly reduces the number of I / O and improves its efficiency.
From this we can understand why the index needs to use data structure. So why not use a binary tree for an index and use B + tree instead?

MO_ On SQL optimization

On the right side of the above figure is a common binary tree model.
The rule of binary tree is that the element on the left of the next node is smaller than that of the previous node (22 < 34),
The element to the right of the next node is greater than or equal to that of the previous node (89 > = 34).
Combined with the figure below, we can see why the index does not use a binary tree.

MO_ On SQL optimization

As can be seen from the above figure, when the elements are gradually increasing, the element nodes of the binary tree are arranged in a single column.
At this time, if we want to take element 6, we can only make the disk do 6 I / O.
So in order to solve this problem, we can use the red black tree (balanced binary tree).

2.2.2 red black tree (balanced binary tree)

This is also the case when the elements are increasing in turn.

MO_ On SQL optimization

It can be seen that red and black trees are balanced on the basis of binary trees, instead of being arranged in a single column.
But why does the index still not use red and black trees? Because at present, the amount of data is small and the level is not high, but there are usually hundreds of thousands, millions or even tens of millions of data in the database.
We can estimate that if there are 1 million data stored in the table, that is, 2 ^ n = 1 million, n = log (2) (1 million), n ≈ 20.
This means that if we need to get the number on the deepest node, we need to read and write 20 or more times of I / O.
From this, we can see that the performance of the mangrove is still poor in large amounts of data. It does not conform to the feature that index can obtain data efficiently.

2.2.3 B-tree

In order to solve the problem that the red black tree still has a deep hierarchy when it stores a large amount of data. So there's a better solution, B-tree. Let's take a look at the model of B-tree.

MO_ On SQL optimization

As can be seen from the above figure, on the basis of red and black trees. The leaf node of B-tree (similar rows of 15, 56, 77) can store multiple elements instead of only one element.
This greatly increases the utilization space of each leaf node and reduces the number of layers. But we also see that there is data under each digital node.
Then a new problem arises. This data is the stored data. When a row of data is too large, the number of elements that each leaf node can hold will be reduced correspondingly.
We can query the size of the leaf node through the following SQL, which is usually 16kb,
SHOW GLOBAL STATUS LIKE 'INNODB_page_size';
If one data is 1KB, it means that each leaf node can hold up to 16 elements. So when the amount of data is too large, there are still problems like red and black trees.

2.2.4 B+TREE

So it's finally B + tree's turn to play. Let's see how B + tree skillfully solves the problems faced by B-tree.

MO_ On SQL optimization

It can be seen that the B + tree non leaf node (the leaf node is the bottom row) does not store data, but stores indexes (redundancy).
Only leaf nodes store data and contain all indexes. So what's the point?
As mentioned above, the leaf node size is usually 16 KB. If the index (redundancy) is bigint and blank (the starting position of the connecting arrow is actually the pointer), then 8b + 6B = 14b (estimated).
Then the number of elements that each leaf node can hold: 16kb = 16 * 1024b, n = 16 * 1024 / 14 ≈ 1170. That means that the leaf node can store about 1170 elements.
If the data is 1KB, the same as B-tree, that is, it can hold 16 elements. Then the total number of non leaf nodes is 1170 * 1170 * 16 ≈ 22 million elements.
Generally speaking, the level of B + tree is 2-4 layers, and tens of millions of data can be accurately located only by 2-4 I / O.
In the case of large amount of data, it can still obtain data efficiently, which is the reason why the underlying data structure of the index is B + tree.

2.3 leftmost prefix principle

Here is only a brief summary of its principle. How to embody it in SQL will be interpreted in combination with Part 3, explain.
When using a federated index (an index composed of multiple columns), the query should follow the order from left to right, and the middle column cannot be skipped. Otherwise, the index will be invalid.

3、 Explain

3.1 explain

In the previous part, we have a more in-depth understanding of the index, but don't worry. This part will not discuss in detail how to use the index to optimize SQL.
Before that, we need to understand a tool for analyzing SQL performance, which is explain.
Using the explain keyword, you can simulate the optimizer to execute SQL statements and analyze the performance bottlenecks of query statements or structures. We can carry out corresponding optimization according to the analysis results.

Now let's take a look at explain in combination with SQL.
--Cast
DROP TABLE IF EXISTS `actor`;
CREATE TABLE `actor`  (
  `id` int(11) NOT NULL,
  `name` varchar(45) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  `update_time` datetime NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

INSERT INTO `actor` VALUES 
(1, 'a', '2018-10-23 16:04:40'),
(2, 'b', '2018-10-23 16:04:40'),
(3, 'c', '2018-10-23 16:04:40');

--Movie list
DROP TABLE IF EXISTS `film`;
CREATE TABLE `film`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(10) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `idx_name`(`name`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 4 CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

INSERT INTO `film` VALUES 
(1, 'film1'),
(2, 'film2'),
(3, 'film0');

--Film actor relationship table
DROP TABLE IF EXISTS `film_actor`;
CREATE TABLE `film_actor`  (
  `id` int(11) NOT NULL,
  `film_id` int(11) NOT NULL,
  `actor_id` int(11) NOT NULL,
  `remark` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `idx_film_actor_id`(`film_id`, `actor_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

INSERT INTO `film_actor` VALUES 
(1, 1, 1, NULL),
(2, 1, 2, NULL),
(3, 2, 1, NULL);
mysql> EXPLAIN SELECT * FROM actor;

MO_ On SQL optimization

The figure above shows the result of performing explain display. If there is a join connecting multiple tables, one more row will be output for each join table.
The type column is a column that needs a long time to understand. Of course, with the increase in the number of uses, practice makes perfect, so there is no need to memorize.

3.1.1 ID column

1) The ID number is the sequence number of the select. If there are several select, there are several IDS, and the order of the IDS is increased according to the order of the select.
2) The larger the ID, the higher the execution priority. If the ID is the same, it will be executed from top to bottom. If the ID is null, it will be executed last.

3.1.2 select type column

The select type column represents simple or complex queries. There are the following types:
1) Simple: simple query without subquery, derived and union.
2) Primary: select the outermost layer of a complex query.
3) Subquery: subquery after select (excluding after from)
4) For the subquery after derived: from, MySQL will store the query results into a temporary table, also known as a derived table. Let's take a look at SQL:
EXPLAIN SELECT
    ( SELECT 1 FROM actor WHERE id = 1 ) 
FROM
    ( SELECT * FROM film WHERE id = 1 ) der;

MO_ On SQL optimization

5) Union: the second and subsequent select in the union.

3.1.3 table column

This column represents which table is being accessed by a row of explain.
When there are subqueries in from, the column is displayed as < derived n >, where n represents the ID number, which means that the query with id = n is executed first.
When there is a union, the value of the table column of union result is < union1,2 >, and 1 and 2 represent the selected row ID of the union.

3.1.4 type column

This column represents the association type or access type, that is, MySQL determines how to find the rows in the table and the approximate range of data row records.
The order from the best to the worst is system > const > Eq_ ref > ref > range > index > ALL。
In general, you have to ensure that the query reaches the range level, preferably Ref.
Null: MySQL can decompose the query statement in the optimization phase, and there is no need to access the table or index at the execution stage.
For example, if you select the minimum value in the index column, you can find the index separately and do not need to access the table at execution time 

const, system:
MySQL can optimize a part of the query and convert it into a constant (see the results of showwarnings).
When all columns of primary key or unique key are compared with constants, the table has at most one matching row, which is read once, which is faster.
System is a special case of const. If only one tuple in the table matches, it is system.

eq_ref:
All parts of the primary key or unique key index are used by the join, and only one qualified record will be returned at most.
This is probably the best join type outside const, which does not appear in a simple select query.

Ref: compared with EQ_ Ref, instead of using a unique index, use a partial prefix of a normal index or a unique index,
To compare an index to a value, multiple rows may be found that match the criteria.

Range: range scanning usually appears in operations such as in (), between, >, <, > =. Uses an index to retrieve rows for a given range.

Index: scans the full table index, which is usually faster than all.

All: full table scan, which means that MySQL needs to find the required rows from the beginning to the end. In general, this needs to be optimized by increasing the index.

3.1.5 possible_ Keys column

This column represents the index column that may be used.
In explain, the possible keys column has a value and the key column is null. When possible, because of the small amount of data, MySQL thinks that the full table scanning efficiency is higher.
If the column is null, there is no associated index. At this time, we can consider adding appropriate index to improve the query efficiency.

3.1.6 key column

This column represents the index actually used by mysql.
If no index is used, the column is null.
If you want to force Mysql to use or ignore possible_ In the query, use force index and ignore index.

3.1.7 key_ Len column

This column represents the number of bytes that MySQL uses the index. When using the federated index, it uses the key_ Len will know which columns are being used.
For example, the following SQL, key_ If len = 4, you can infer that only the ID column in the federated index is used, because int is 4 bytes.
mysql> EXPLAIN SELECT * FROM film_actor WHERE film_id = 2;

MO_ On SQL optimization

key_ The calculation rules of len are as follows:
1) String
    Char (n): n byte length
    Varchar (n): 2 bytes for string length, 3N + 2 for UTF-8
2) Numerical type
    Tinyint: 1 byte
    Smallint: 2 bytes
    Int: 4 bytes
    Bigint: 8 bytes
3) Time type
    Date: 3 bytes
    Timestamp: 4 bytes
    Datetime: 8 bytes
If the field is allowed to be null, whether the 1-byte record is null is required

Column 3.1.8 ref

This column shows the columns or constants used by the table to find values in the index of key column records. The common ones are const (constant) and field name (for example: film.id )。

3.1.9 rows column

This column is the number of rows that MySQL estimates to read and detect. Note that this is not the number of rows in the result set.

3.1.10 extra column

This column shows additional information. Common important values are as follows:
1) Using index: using the overlay index.
2) Using where: the where statement is used to process the results, and the columns of the query are not covered by the index.
3) Using index condition: the query column is not completely covered by the index. In the where condition, there is a range of leading columns.
4) Using temporary: MySQL needs to create a temporary table to process the query. In this case, it is generally necessary to optimize. First of all, we think of using index to optimize.
5) Using filesort: the external sort will be used instead of the index sort. When the data is small, it will be sorted from memory, otherwise it needs to be sorted on disk. In this case, it is generally necessary to consider using index to optimize.
6) Select tables optimized away: use some aggregate functions (such as Max, min) to access a field with an index.

4、 Indexing best practices

--Employee list
CREATE TABLE `employees` (
`id` INT ( 11 ) NOT NULL AUTO_INCREMENT,
`Name ` varchar (24) not null default 'comment' name ',
`Age ` int (11) not null default '0' comment 'age',
`Position ` varchar (20) not null default 'comment' position ',
`hire_ time` TIMESTAMP NOT NULL DEFAULT CURRENT_ Timestamp comment 'entry time',
PRIMARY KEY ( `id` ),
KEY `idx_name_age_position` ( `name`, `age`, `position` ) USING BTREE 
) ENGINE = INNODB AUTO_ Increment = 4 default charset = utf8 comment: 'employee record form';

INSERT INTO employees ( NAME, age, position, hire_time )
VALUES
    ( 'LiLei', 22, 'manager', NOW( ) ),
    ( 'HanMeimei', 23, 'dev', NOW( ) ),
    ( 'Lucy', 23, 'dev', NOW( ) );

4.1 full value matching

EXPLAIN SELECT * FROM employees WHERE name= 'LiLei';

MO_ On SQL optimization

EXPLAIN SELECT * FROM employees WHERE name= 'LiLei' AND age = 22;

MO_ On SQL optimization

EXPLAIN SELECT * FROM employees WHERE name= 'LiLei' AND age = 22 AND position ='manager';

MO_ On SQL optimization

4.2. Leftmost prefix rule

If multiple columns are indexed, follow the leftmost prefix rule. This means that the query starts at the top left of the index and does not skip columns in the index.

4.3. If you do not do any operation (calculation, function, (automatic or manual) type conversion) on the index column, it will lead to index invalidation and turn to full table scanning

4.4. The storage engine cannot use the column to the right of the range condition in the index

EXPLAIN SELECT \* FROM employees WHERE name= 'LiLei' AND age > 22 AND position ='manager';

MO_ On SQL optimization

4.5. Try to use overlay index (only query accessing index (index column contains query column)) to reduce select * statement

4.6. MySQL is not equal to (! =Or < >) when the index cannot be used, it will cause a full table scan

4.7. Is null, is not null also cannot use index

4.8. Like starts with a wildcard character (‘$ABC…’). If the MySQL index fails, it will become a full table scan operation

EXPLAIN SELECT * FROM employees WHERE name like '%Lei'

MO_ On SQL optimization

4.9. The string index is invalid without single quotation marks

4.10. Use or or or in less. When using it to query, MySQL does not necessarily use index. MySQL internal optimizer will evaluate whether to use index according to the retrieval proportion, table size and other factors. See range query optimization for details

EXPLAIN SELECT * FROM employees WHERE name = 'LiLei' or name = 'HanMeimei';

MO_ On SQL optimization

4.11. Range query optimization

Add a single value index to age.
ALTER TABLE `employees` ADD INDEX `idx_age` ( `age` ) USING BTREE;
EXPLAIN SELECT * FROM employees WHERE age >= 1 AND age <= 100;

MO_ On SQL optimization

Reasons for not following the index: MySQL internal optimizer will evaluate whether to use the index according to the retrieval proportion, table size and other factors.
For example, in this example, it may be that the optimizer finally chooses not to go to the index because of the large amount of data in a single query.
Optimization method: a large range can be split into multiple small ranges.
EXPLAIN SELECT * FROM employees WHERE age >= 1 AND age <= 50;
EXPLAIN SELECT * FROM employees WHERE age >= 51 AND age <= 100;

MO_ On SQL optimization


All of the above codes have been executed in the machine without error.

5、 Reference

Dynamic demonstration model of data structure

MO_ Review summary of single case model of or

30 suggestions for writing high quality SQL

6、 Finally

If there is any deficiency, please correct it.
If you are thirsty for knowledge, you are foolish if you are modest.