Young Xia! How to write good SQL?


Young Xia! How to write good SQL?

The blogger (coder) is mainly responsible for using Alibaba cloud MySQL database. Recently, slow SQL alarms have frequently occurred,When executingThe longest roomHowever, up to 5 minutes。 After exporting the log, the main reason isNo index hits and no paging processing

In fact, this is a very low-level mistake. I can’t help cooling my back. The technical level of team members needs to be improved. In the process of transforming these SQL, I summarized some experiences and shared them with you. If there are mistakes, you are welcome to criticize and correct them.

MySQL performance

Maximum data volume

Regardless of the amount of data and concurrency, it is hooliganism to talk about performance 。 MySQL does not limit the maximum number of records in a single table, which depends on the file size limit of the operating system.

file system Single file size limit
FAT32 Maximum 4G
NTFS5.0 2TB Max
EXT2 The block size is 1024 bytes and the maximum file capacity is 16GB; The block size is 4096 bytes and the maximum file capacity is 2TB
EXT3 The block size is 4KB and the maximum file capacity is 4tb
EXT4 Theory can be greater than 16TB

Alibaba’s Java development manual proposes that only when the number of rows in a single table exceeds 5 million or the capacity of a single table exceeds 2GB, can sub database and sub table be recommended. Performance is determined by comprehensive factors. Regardless of business complexity, the influence degree is hardware configuration, MySQL configuration, data table design and index optimization. The value of 5 million is only for reference, not an iron law. Bloggers have operated a single table with more than 400 million rows of data. Paging query of the latest 20 records takes 0.6 seconds, and the SQL statement is roughlyselect field_1,field_2 from table where id < #{prePageMinId} order by id desc limit 20, prepageminid is the minimum ID of the data record on the previous page. Although the query speed was fairly good at that time, with the continuous growth of data, it will be overwhelmed one day. Dividing databases and tables is a long-term and high-risk task. We should optimize the current structure as much as possible, such as upgrading hardware, migrating historical data, etc. there is no way to divide again. Students interested in sub database and sub table can read the basic idea of sub database and sub table.

Maximum concurrent number

The number of concurrent requests refers to how many requests the database can process at the same time, which is determined by max_ Connections and Max_ user_ Connections decision** max_ Connections refers to the maximum number of connections of MySQL instance. The upper limit is 16384, max_ user_ Connections is the maximum number of connections per database user. MySQL provides a buffer for each connection, which means that it consumes more memory. If the number of connections is set too high, the hardware can’t bear it, and it is too low to make full use of the hardware. Generally, the ratio of the two is more than 10%, and the calculation method is as follows:

max_used_connections / max_connections * 100% = 3/100 *100% ≈ 3%

View the maximum number of connections and respond to the maximum number of connections:

show variables like '%max_connections%';show variables like '%max_user_connections%';

In the configuration file my Modify the maximum number of connections in CNF

[mysqld]max_connections = 100max_used_connections = 20

The query took 0.5 seconds

It is recommended to control the time consumption of a single query within 0.5 seconds. 0.5 seconds is an empirical value, which comes from the user experience3 second principle 。 If the user’s operation does not respond within 3 seconds, he will be bored or even quit. Response time = client UI rendering time + network request time + application processing time + database query time. 0.5 seconds is 1 / 6 of the processing time left for the database.

Implementation principle

Compared with NoSQL database, MySQL is a delicate and fragile guy. It is like a female classmate in physical education class. If she has a dispute, she will make trouble with her classmates (it is difficult to expand capacity), pant after two steps (small capacity and low concurrency), and often ask for leave if she is unwell (there are too many SQL constraints). It’s easier to implement distributed applications than to implement distributed applicationsThe database works less and the application works more 。

  • Make full use of indexes without abusing them. It should be noted that indexes also consume disk and CPU.
  • It is not recommended to use database functions to format data and hand it over to applications for processing.
  • It is not recommended to use foreign key constraints, and use applications to ensure data accuracy.
  • In the scenario of writing more and reading less, the use of unique index is not recommended, and the uniqueness is guaranteed by application.
  • Appropriate redundant fields, try to create intermediate tables, calculate intermediate results with applications, and trade space for time.
  • It is not allowed to execute extremely time-consuming transactions and split them into smaller transactions with the application.
  • Estimate the load and data growth trend of important data tables (such as order tables) and optimize them in advance.

Data sheet design

data type

Selection principle of data type: simpler or less space.

  • If the length can be satisfied, try to use tinyint, smallint and medium for integer\_ Int, not int.
  • If the string length is determined, the char type is used.
  • If varchar can satisfy, text type is not used.
  • For those with high precision requirements, decimal type can be used, or bigint can be used. For example, the exact two decimal places are multiplied by 100 and saved.

Try to use timestamp instead of datetime.

type Occupied byte describe
datetime 8 bytes ‘1000-01-01 00:00:00.000000’ to ‘9999-12-31 23:59:59.999999
timestamp 4 bytes ‘1970-01-01 00:00:01.000000’ to ‘2038-01-19 03:14:07.999999’

Compared with datetime, timestamp takes less space and stores automatically converted time zones in UTC format.

Avoid null values

When the field in MySQL is null, it still occupies space, which will make the index and index statistics more complex. Updating from null value to non null value cannot be updated in place, which is prone to index splitting and affects performance. Try to replace the null value with a meaningful value, which can also avoid the inclusion in the SQL statementis not nullYour judgment.

Texttype optimization

Because the text field stores a large amount of data, the table capacity will rise very early, affecting the query performance of other fields. It is recommended to extract it and put it in the sub table and associate it with the business primary key.

Index optimization

Index classification

  1. General index: the most basic index.
  2. Composite index: the index established on multiple fields can accelerate the retrieval of composite query conditions.
  3. Unique index: similar to ordinary index, but the value of index column must be unique and null value is allowed.
  4. Combined unique index: the combination of column values must be unique.
  5. Primary key index: a special unique index used to uniquely identify a record in the data table. Null values are not allowed. It is generally constrained by primary key.
  6. Full text index: used for massive text query, mysql5 InnoDB and MyISAM after 6 support full-text indexing. Due to poor query accuracy and scalability, more enterprises choose elasticsearch.

Index optimization

  1. Paging query is very important. If the amount of query data exceeds 30%, MySQL will not use index.
  2. The number of single table indexes shall not exceed 5, and the number of single index fields shall not exceed 5.
  3. The prefix index can be used for strings, and the prefix length is controlled within 5-8 characters.
  4. The uniqueness of the field is too low, and it is meaningless to add an index, such as delete or gender.
  5. Use the overlay index reasonably as follows:

    select login_name, nick_name from member where login\_name = ?

login_ name, nick_ The name two fields establish a combined index, which is better than login\_ Name simple index is faster


Batch processing

When the blogger was a child, he saw a small hole dug in the fish pond to discharge water, and there were all kinds of floating objects on the water surface. Duckweed and leaves can always pass through the water outlet smoothly, while branches will block the passage of other objects and sometimes get stuck, which needs to be cleaned manually. MySQL is a fish pond, the maximum concurrency and network bandwidth are the water outlet, and user SQL is a floating object. Queries without paging parameters or update and delete operations that affect a large amount of data are branches. We need to break them up and process them in batches. For example: business description: update all expired coupons of users to unavailable status. SQL statement:update status=0 FROMcoupon WHERE expire_date <= #{currentDate} and status=1;If a large number of coupons need to be updated to unavailable status, executing this SQL may block other SQL. The batch processing pseudo code is as follows:

int pageNo = 1;
int PAGE_SIZE = 100;
while(true) {
    List<Integer> batchIdList = queryList('select id FROM `coupon` WHERE expire_date <= #{currentDate} and status = 1 limit #{(pageNo-1) * PAGE_SIZE},#{PAGE_SIZE}');
    if (CollectionUtils.isEmpty(batchIdList)) {
    update('update status = 0 FROM `coupon` where status = 1 and id in #{batchIdList}')
    pageNo ++;

Operator < > optimization

Generally, the < > operator cannot use the index. For example, query orders with an amount of less than 100 yuan:select id from orders where amount != 100;If there are few orders with an amount of 100 and the data distribution is seriously uneven, it is possible to use the index. In view of this uncertainty, union is used to aggregate search results. The rewriting method is as follows:

(select id from orders where amount > 100) union all(select id from orders where amount < 100 and amount > 0)

Or optimization

Under the InnoDB engine, or cannot use composite indexes, for example:

select id,product_name from orders where mobile_no = '13421800407' or user_id = 100;

Or cannot hit Mobile_ no + user_ The combination index of ID can be union, as shown below:

(select id,product_name from orders where mobile_no = '13421800407') union(select id,product_name from orders where user_id = 100);

ID and product\_ The name field has an index, so the query is the most efficient.

In optimization

  1. In is suitable for large main tables and small sub tables, and exist is suitable for large main tables and small sub tables. Due to the continuous upgrading of the query optimizer, the performance of the two is almost the same in many scenarios.
  2. Try to change to join query, for example:

select id from orders where user\_id in (select id from user where level = ‘VIP’);

Join is adopted as follows:

select from orders o left join user u on o.user_id = where u.level = 'VIP';

No column operation

Generally, the operation in the query criteria column will lead to the invalidation of the index, as follows: query the order of the current day

select id from order where date_format(create_time,'%Y-%m-%d') = '2019-07-01';

date\_ The format function will make the query unable to use the index. After Rewriting:

select id from order where create_time between '2019-07-01 00:00:00' and '2019-07-01 23:59:59';

Avoid select all

If you do not query all the columns in the table, avoid usingSELECT *, it does a full table scan and cannot make effective use of indexes.

Like optimization

Like is used for fuzzy queries, for example (the field has been indexed):

SELECT column FROM table WHERE field like '%keyword%';

This query misses the index. Change to the following expression:

SELECT column FROM table WHERE field like 'keyword%';

Removing the previous% query will hit the index, but must the product manager make a fuzzy match? Full text indexing can be tried, but elasticsearch is the ultimate weapon.

Join optimization

The implementation of join adopts the nested loop join algorithm, that is, the result set of the driving table is used as the basic data, and the knot data is used as the filter condition to circularly query the data in the next table, and then merge the results. If there are multiple joins, take the previous result set as cyclic data and query the data in the next table again.

  1. The driven table and the driven table should add query conditions as much as possible to meet the on condition instead of where, and use small result sets to drive large result sets.
  2. Add an index to the join field of the driven table. If the index cannot be established, set enough join buffer size.
  3. It is forbidden to join more than three tables and try to add redundant fields.

Limit optimization

Limit is used for paging queries. The more you flip back, the worse the performance. The solution principle is:Reduce scanning range, as follows:

Select * from orders by ID desc limit 100000,10 takes 0.4 seconds select * from orders by ID desc limit 1000000,10 takes 5.2 seconds

First filter out the ID to narrow the query range, which is written as follows:

Select * from orders where id > (select id from orders by ID desc limit 1000000, 1) order by ID desc limit 0,10 takes 0.5 seconds

If the query criteria only has a primary key ID, it is written as follows:

Select id from orders where id between 1000000 and 1000010 order by ID desc takes 0.3 seconds

What if the above plan is still slow? I have to use the cursor to realize the query of my friends who are interested in JDBC

Other databases

As a back-end developer, you must be proficient in MySQL or SQL server as the storage core, and actively pay attention to NoSQL database. They are mature enough and widely used to solve the performance bottleneck in specific scenarios.

classification database characteristic
Key value type Memcache For content caching, high access load of large amounts of data
Key value type Redis For content caching, it supports more data types than Memcache and can persist data
Column storage HBase The core database of Hadoop system, massive structured data storage, and big data are necessary.
Document type MongoDb Well known document database can also be used for caching
Document type CouchDB Apache’s open source project focuses on ease of use and supports rest APIs
Document type SequoiaDB Domestic well-known document database
graphical Neo4J It is used for social network construction, relationship map, recommendation system, etc