- Index correlation
- Transactional correlation
- Table structure design
- Storage engine related
- Fragmented problem
- Reference articles
< font color = “red” > the main audience of this article is developers, so it does not involve MySQL service deployment and other operations, and there are many contents, so you are ready for patience and melon seed mineral water. < / font >
A while ago, I studied MySQL systematically, and I also had some practical experience. I came across an interview article related to MySQL, and found that I couldn’t answer some of the questions well. Although most of the knowledge points were known, I couldn’t connect the knowledge
Therefore, I decided to make a MySQL soul 100 question, try to answer questions in order to make my understanding of knowledge more in-depth
This article will not start from the use of select to explain MySQL in detail. It is mainly aimed at some knowledge points of MySQL that developers need to know, including index, transaction, optimization, etc., and gives answers in the form of high-frequency questions in the interview. If you have other MySQL interview questions and find the questions interesting or difficult, you can comment on the questions or send an email to[email protected], I will include it in this article and mark your name
Most of the content of this article has been explained in detail in other articles, especially the two parts of index and transaction, which are respectively in the following articles. Friends who are interested in deepening understanding can move on
MySQL index principle and optimization
Transaction isolation level of MySQL
About MySQL index, once made a summary, the article link here MySQL index principle and its optimization
1. What is index?
Index is a kind of data structure, which can help us to search data quickly
2. What kind of data structure is index?
The data structure of the index is related to the implementation of the specific storage engine. There are many indexes used in mysql, such as hash index and B + tree index. The default index of InnoDB storage engine we often use is B + tree index
3. What’s the difference or advantage between hash index and B + tree?
First, we need to know the underlying implementation principle of hash index and B + tree index:
The bottom layer of the hash index is the hash table. When searching, you can call the hash function once to get the corresponding key value, and then query back to the table to get the actual data. The bottom layer of the B + tree is a multi-channel balanced search tree. For each query, you start from the root node, find the leaf node to get the key value, and then judge whether you need to query back to the table according to the query
Then we can see that they are different:
- The hash index is faster (in general) than the range index
Because after the hash function is used to build the index in the hash index, the order of the index cannot be consistent with the original order, and range query cannot be supported. All the nodes of the B + tree follow (the left node is smaller than the parent node, the right node is larger than the parent node, and the multi fork tree is similar), and the range is naturally supported
- Hash index does not support sorting by index. The principle is the same as above
- Hash index does not support fuzzy query and left prefix matching of multi column index. The principle is also due to the unpredictability of hash functionAAAAandAAAABIndex of has no relevance
- Hash index can’t avoid returning table to query data at any time, while B + tree can only query through index when it meets some conditions (clustered index, overlay index, etc.)
- Although hash index is fast in equivalent query, it is not stable. Its performance is unpredictable. When there are a large number of duplicate key values, hash collision occurs, and the efficiency may be very poor at this time. While the query efficiency of B + tree is relatively stable. For all queries, it is from the root node to the leaf node, and the height of the tree is relatively low
Therefore, in most cases, direct selection of B + tree index can achieve stable and better query speed without using hash index
4. The above mentioned B + tree does not need to query the data back to the table when meeting the clustering index and coverage index. What is clustering index?
In the B + tree index, the leaf node may store the current key value, or the current key value and the whole row of data, which are clustered index and non clustered index. In InnoDB, only the primary key index is clustered index. If there is no primary key, select a unique key to establish clustered index. If there is no unique key, implicitly generate a key to establish clustered index
When the query uses clustered index, the whole row of data can be obtained at the corresponding leaf node, so it is not necessary to query back to the table again
5. Will non clustered index return to table query?
Not necessarily, this involves whether all the fields required by the query statement hit the index. If all the fields hit the index, then there is no need to query back to the table
For a simple example, suppose we set up an index on the age of the employee table, then when
select age from employee where age < 20The leaf node of index already contains age information, so it will not query back
6. What are the factors to consider when building indexes?
When building an index, we usually need to consider the frequency of using fields, and the fields that are often used as conditions for query are more suitable. If we need to build a joint index, we need to consider the order in the joint index. In addition, we need to consider other aspects, such as preventing too much pressure on the table. These are related to the actual table structure and query method
7. What is a union index? Why should we pay attention to the order in the union index?
MySQL can use multiple fields to build an index at the same time, which is called a union index. In a union index, if you want to hit the index, you need to use it one by one according to the order of the fields when you build the index, otherwise you cannot hit the index
The specific reasons are as follows:
When MySQL uses the index, it needs to have an orderly index. Assuming that a joint index of “name, age, school” has been established, the index will be sorted by name first, by age if the names are the same, and by school if the values of age are the same
When querying, the index is only strictly ordered by name, so you must first use the name field for equivalent query, and then for the matched column, it is strictly ordered by the age field. At this time, you can use the age field for index search,,, and so on. Therefore, you should pay attention to the order of index columns when building a joint index. In general, you will need frequent queries or Columns with high field selectivity are placed at the front. In addition, they can be adjusted individually according to special query or table structure
8. Has the created index been used? Or how can we know why this statement runs slowly?
MySQL provides the explain command to view the execution plan of a statement. Before executing a statement, MySQL will query the optimizer once, and then get the analysis of the statement, that is, the execution plan, which contains a lot of information
We can use the information related to index to analyze whether the index is hit, such as possilbe_key, key, key_len and other fields, respectively, to explain the index that this statement may use, the index actually used and the index length used
9. Under what circumstances will an index be created for the column but not used in the query?
- Use not equal to query,
- Columns participate in mathematical operations or functions
- To the left of the string like is a wildcard. Similar to ‘% AAA’
- When MySQL analyzes the whole table scan faster than using index, it does not use index
- When using union index, the former condition is range query, and the latter cannot use index even if it conforms to the leftmost prefix principle
In the above case, MySQL cannot use the index
1. What is business?
The most classic way to understand what a transaction is is to transfer chestnuts. I believe you all know it, so let’s not talk about it here
A transaction is a series of operations that conform to the acid feature. The most common understanding is that the operations in a transaction are either all successful or all failed. However, this is not enough
2. What is acid? Can you elaborate?
Atomicity, as mentioned above, is either complete success or complete failure. It is impossible to perform only part of the operation
The system (database) always transfers from one consistent state to another, and there is no intermediate state
Isolation: Generally speaking, a transaction is invisible to other transactions before it is fully committed. Note that the previous one is usually red, which means there are exceptions
Persistence. Once a transaction is committed, it will always be like this. Even if the system crashes, the result of the transaction will not be affected
3. What if there are multiple transactions in progress at the same time?
The concurrency of multiple transactions generally causes the following problems:
- Dirty read: a transaction reads uncommitted content of B transaction, and B transaction is rolled back
- Non repeatable reading: when a transaction is set to read only the committed part of B transaction, it will cause two queries in a transaction, but the results are different, because B transaction has committed during this period
- Phantom reading: transaction a reads a range of contents, while transaction B inserts a piece of data during this period, causing “illusion”
4. How to solve these problems? Do you know the transaction isolation level of MySQL?
The four isolation levels of MySQL are as follows:
- Read uncommitted
This is the exception mentioned above. Under this isolation level, other transactions can see some uncommitted modifications of this transaction. As a result, dirty reading may occur (uncommitted parts of other transactions are read, and then the transaction is rolled back)
This level of performance does not have enough advantages, but there are many problems, so it is rarely used
- Read committed
Other transactions can only read the committed part of this transaction. This isolation level has the problem of non repeatable reading. Two reads in the same transaction can get different results, because another transaction has modified the data
- Repeatable read
The repeatable read isolation level solves the above non repeatable read problem (as you can see from the name), but there is still a new problem, namely, unreal read. When you read the data row with ID > 10, you add a read lock to all the rows involved. At this time, an exception transaction inserts a new data with id = 11. Because it is newly inserted, it will not trigger the exclusion of the above lock. Then perform the function A piece of data with id = 11 will be found in the next query of row, but the last query operation did not get it, and there will be primary key conflict when inserting again
- Serializable (serializable)
This is the highest isolation level, which can solve all the problems mentioned above, because it forces the serial execution of all operations, which will lead to the rapid decline of concurrent performance, so it is not very common
5. What isolation level does InnoDB use?
InnoDB uses the repeatable read isolation level by default
6. Do you know about MySQL locks?
When there are concurrent transactions in the database, data inconsistency may occur. At this time, some mechanisms are needed to ensure the access order. The lock mechanism is such a mechanism
Just like a hotel room, if you enter or leave at will, there will be a situation where many people snatch the same room. If you install a lock on the room, the person who applies for the key can check in and lock the room. Other people can only use it again after he has used it
7. What locks does MySQL have? Does locking like the above hinder the concurrency efficiency?
In terms of lock categories, there are shared locks and exclusive locks
Shared lock: it is also called read lock. When users want to read data, they add a shared lock to the data. Multiple shared locks can be added at the same time
Exclusive lock: also known as write lock. When a user wants to write data, an exclusive lock is added to the data. Only one exclusive lock can be added. It is mutually exclusive with other exclusive locks and shared locks
In the above example, there are two kinds of user’s behaviors. One is to come to see a house, and it is acceptable for multiple users to see a house together. The other is to stay for one night. During this period, neither those who want to stay nor those who want to see a house can
The granularity of lock depends on the specific storage engine. InnoDB implements row level lock, page level lock and table level lock
Their locking cost is from large to small, and their concurrency is also from large to small
Table structure design
1. Why set a primary key as much as possible?
The primary key is the guarantee for the database to ensure the uniqueness of data rows in the whole table. Even if there is no primary key in this table in business, it is recommended to add a self growing ID column as the primary key. After the primary key is set, it may be faster in subsequent deletion, modification and query and ensure the safety of operation data range
2. Does the primary key use an auto ID or UUID?
It is recommended to use the self increasing ID instead of UUID
Because in the InnoDB storage engine, the primary key index exists as a clustered index. That is to say, the primary key index and all data (in order) are stored on the B + tree leaf node of the primary key index. If the primary key index is an auto increasing ID, it only needs to be listed in the back row. If it is a UUID, because the ID and the original size are uncertain, it will cause a lot of data insertion and data insertion Move, and then lead to a lot of memory fragmentation, resulting in a drop in insertion performance
In a word, in the case of large amount of data, the performance of using auto – increment primary key will be better
The picture is from high performance mysql: the default suffix is to use the auto increment ID,
_uuidIn order to use UUID as the primary key, the performance of inserting 100W and 300W rows is tested
The primary key is a clustered index. If there is no primary key, InnoDB will select a unique key as the clustered index. If there is no unique key, an implicit primary key will be generated
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
If you do not define a PRIMARY KEY for your table, MySQL picks the first UNIQUE index that has only NOT NULL columns as the primary key and InnoDB uses it as the clustered index.
3. Why does the field need to be defined as not null?
This is how MySQL website introduces:
NULL columns require additional space in the rowto record whether their values are NULL. For MyISAM tables, each NULL columntakes one bit extra, rounded up to the nearest byte.
A null value takes up more bytes and causes many unexpected situations in the program
4. If you want to store the password hash of users, what fields should be used for storage?
Fixed length strings such as password hash, salt, user ID number should be stored in char instead of varchar, which can save space and improve retrieval efficiency
Storage engine related
1. What storage engines does MySQL support?
MySQL supports a variety of storage engines, such as InnoDB, MyISAM, memory, archive, etc. in most cases, it is most appropriate to choose InnoDB engine directly. InnoDB is also the default storage engine of MySQL
- What’s the difference between InnoDB and MyISAM?
- InnoDB supports things, while MyISAM does not
- InnoDB supports row level locks while MyISAM supports table level locks
- InnoDB supports mvcc, but MyISAM does not
- InnoDB supports foreign keys, but MyISAM does not
- InnoDB does not support full-text indexing, while MyISAM does.
1. What’s the difference between varchar and char in MySQL
Char is a fixed length field, if applied
char(10)No matter how much content is actually stored, the field takes up 10 characters, while varchar is variable, that is to say, the maximum length is applied for, the space occupied is the actual character length + 1, and the space used for the last character storage is how long
In terms of retrieval efficiency, char > varchar. Therefore, in use, if you want to determine the value length of a field, you can use char, otherwise you should try to use varchar. For example, if you want to store the password encrypted by MD5, you should use char
2. What do varchar (10) and int (10) mean?
Varchar’s 10 represents the space length of application and the maximum length of data that can be stored, while int’s 10 only represents the length of display, less than 10 bits are filled with 0. That is to say, the number size and space occupied by int (1) and int (10) are the same, but they are displayed according to the length during display
3. How many entry formats do MySQL binlog have? What are the differences?
There are three formats, statement, row and mixed
- In statement mode, the record unit is a statement. That is to say, the impact caused by each SQL will be recorded. Because the execution of SQL has context, relevant information needs to be saved when saving. At the same time, some statements that use functions cannot be copied by records
- In row level, the record unit can basically record all the changes of each line. However, due to many operations, a large number of lines will be changed (such as alter table). Therefore, this mode saves too much information and logs
- Mixed. A compromise scheme in which common operations use statement records and row when statement cannot be used
In addition, the row level is also optimized in the new version of MySQL. When the table structure changes, statements will be recorded instead of line by line
4. How to deal with large paging?
Large paging is usually solved in two directions
- At the database level, this is also our main focus (although the effect is not so great), similar to
select * from table where age > 20 limit 1000000,10In fact, this query can also be optimized. This statement needs to load 1000000 data and then basically discard them all. It’s slow to take only 10 queries. At that time, we can change it to
select * from table where id in (select id from table where age > 20 limit 1000000,10)In this way, although a million data are also loaded, the speed will be very fast because all the fields to be queried are in the index due to index coverage. At the same time, if the ID is continuous, we can also
select * from table where id > 1000000 limit 10The efficiency is also good. There are many possibilities for optimization, but the core idea is the same: reduce the load data
- To reduce this kind of request from the perspective of demand… Is mainly not to do similar requirements (directly jump to a specific page after millions of pages. Only page by page viewing or following a given route is allowed, which is predictable and cacheable) and to prevent ID leakage and continuous malicious attack
In fact, to solve the problem of large paging, we mainly rely on caching. We can find the content in advance and cache it in K-V database such as redis, and directly return it
In Alibaba’s Java development manual, the solution to large paging is similar to the first one mentioned above
5. Have you ever cared about the SQL time-consuming in the business system? Has the statistics been too slow? How has the slow query been optimized?
In the business system, in addition to the query using the primary key, I will test its time-consuming in the test library. The statistics of slow query is mainly done by the operation and maintenance department, and I will regularly feed back the slow query in the business to us
In the optimization of slow query, first of all, we need to understand the cause of slow query. Is the query condition not hitting the index? Is the unnecessary data column loaded? Or is the data volume too large?
So optimization is also aimed at these three directions,
- First, analyze the statement to see if additional data has been loaded. It may be that redundant rows have been queried and discarded. It may be that many unnecessary columns in the results have been loaded. Analyze and rewrite the statement
- Analyze the execution plan of the statement, and then obtain the usage of the index. Then modify the statement or the index to make the statement hit the index as much as possible
- If you can’t optimize the statement, you can consider whether the amount of data in the table is too large. If so, you can divide the table horizontally or vertically
6. As mentioned above, can you give an example that suits them?
Horizontal table splitting is based on rows. Suppose we have a user table, and the primary key is an auto increasing ID and the user ID at the same time. There are more than 100 million pieces of data, so the query effect in a table is not ideal at this time. We can divide tables according to the primary key ID, whether by tail number or by ID interval. Suppose we divide 100 tables according to tail number 0-99, then in each table At this time, the query efficiency is undoubtedly able to meet the requirements
Vertical table is divided by columns. Suppose we have an article table now. Including fields
ID summary content. the display form of the system is to refresh a list, which only contains the title and summary. When users click an article to enter the details, they need the body content. At this time, if there is a large amount of data, putting the large and infrequently used columns together will slow down the query speed of the original table. We can divide the above table into two
Id- contentWhen the user clicks details, the primary key can retrieve the content again. The increased storage is only a very small primary key field. The cost is very small
Of course, the relation between sub table and business is very high. Before sub table, we must do research and benchmark well. Don’t operate blindly according to our own conjecture
7. What is a stored procedure? What are the advantages and disadvantages?
Stored procedures are pre compiled SQL statements.
1. More straightforward understanding: a stored procedure can be said to be a recordset, which is a code block composed of some T-SQL statements. These T-SQL statement codes implement some functions like a method (adding, deleting, modifying and querying a single table or multiple tables), and then give a name to this code block, and call it when using this function.
2. Stored procedure is a precompiled code block with high execution efficiency. A stored procedure can replace a large number of SQL statements, reduce network traffic, improve communication speed, and ensure data security to a certain extent
However, in Internet projects, stored procedures are not recommended. What’s more famous is that Alibaba’s Java development manual forbids the use of stored procedures. My personal understanding is that in Internet projects, the iteration is too fast, the life cycle of the project is relatively short, and people flow is more frequent than traditional projects. In this case, the management of stored procedures does not exist So convenient, at the same time, reusability is not as good as writing in the service layer
8. Say one, say three paradigms
First paradigm: each column can no longer be split
The second paradigm: non – primary key columns are completely dependent on the primary key, not a part of it
Third paradigm: non – primary key column only depends on primary key, not other non – primary key
When designing database structure, we should follow three paradigms as much as possible. If not, we must have enough reasons. For example, performance. In fact, we often compromise database design for performance
9. What’s the difference between Chen and $in mybatis?
I just want to record this problem alone, because the frequency is too high
#The incoming content is treated as a string, and $splices the incoming value directly into the SQL statement
Therefore, SQL injection attack can be prevented to some extent
Finally, welcome to my personal public account [Yan Yan ten]. I will update many back end engineers’ learning notes from time to time.
I also welcome the direct public address, personal letter or email to contact me.
Completed on May 19, 2019
All of the above are personal thoughts. If there is any mistake, please correct it in the comment area.
Please sign your name and keep the original link.
Contact email: [email protected]
More learning notes can be found on personal blog