There are three solutions to the optimization process of MySQL 20 million data large tables at a time!

Time:2021-11-29

Problem overview

Using Alibaba cloud RDS for MySQL database (i.e. MySQL version 5.6), a user’s online record table has nearly 20 million data in six months, and 40 million data in the last year. The query speed is extremely slow and daily stuck. Serious impact on business.

Problem premise: in the old system, the people who designed the system at that time probably didn’t graduate from college. The table design and SQL statements are not only rubbish, but can’t look directly at it. The original developers have left their jobs and come to me for maintenance. This is the legendary “run away if you can’t maintain it”, and then I’m the one who fell into the pit!!!

I try to solve this problem. So, I have this log.

Programme overview

Scheme 1: optimize the existing MySQL database. Advantages: it does not affect the existing business, the source program does not need to modify the code, and the cost is the lowest. Disadvantages: there is an optimization bottleneck, and the amount of data is over 100 million.

Scheme 2: upgrade the database type and change to a database 100% compatible with MySQL. Advantages: it does not affect the existing business, the source program does not need to modify the code, and you can improve the database performance without doing anything. Disadvantages: spend more money

Scheme 3: one step, big data solution, replacement of newsql / NoSQL database. Advantages: strong scalability, low cost, no data capacity bottleneck, disadvantages: need to modify the source code

The above three schemes can be used in sequence. If the data volume is below 100 million, it is not necessary to change NoSQL, and the development cost is too high. I have tried all three solutions, and have formed a landing solution. During this process, I have expressed my condolences to the developers running away for ten thousand times:)

Detailed description of scheme 1: optimize the existing MySQL database

With ALI cloud database big brother telephone communication and Google solution and asked the group of big guys, summarized as follows (are the cream):

  • 1. Performance should be considered in database design and table creation
  • 2. Pay attention to optimization when writing SQL
  • 3. Zoning
  • 4. Sub table
  • 5. Sub warehouse

1. Performance should be considered in database design and table creation

Mysql database itself is highly flexible, resulting in insufficient performance, which seriously depends on the ability of developers. That is to say, if the developer’s ability is high, the MySQL performance is high. This is also a common problem in many relational databases, so the salary of DBAs in companies is usually very high.

Pay attention to the following when designing the table:

  • 1. Avoid null values in table fields. Null values are difficult to query and optimize and occupy additional index space. It is recommended that the default number 0 replace null.
  • 2. Try to use int instead of bigint. If it is not negative, add unsigned (the numerical capacity will be doubled). Of course, tinyint, smallint and medium can be used_ Int is better.
  • 3. Use enumeration or integer instead of string type
  • 4. Try to use timestamp instead of datetime
  • 5. Do not have too many fields in a single table. It is recommended to be within 20
  • 6. Use integer to store IP

Indexes

  • 1. The more indexes are not the better. You should create indexes according to the query. Consider establishing indexes on the columns involved in the where and order by commands. You can check whether the index or full table scanning is used according to explain
  • 2. Try to avoid judging the null value of the field in the where clause, otherwise the engine will give up using the index and scan the whole table
  • 3. Fields with sparse value distribution are not suitable for indexing, such as “gender”, which has only two or three values
  • 4. Only prefix index is built for character field
  • 5. It is better not to use character fields as primary keys
  • 6. Without foreign keys, the program guarantees the constraints
  • 7. Try not to use unique, which is guaranteed and constrained by the program
  • 8. When using multi column indexes, the order of query should be consistent with the query criteria, and unnecessary single column indexes should be deleted

In short, use the appropriate data type and select the appropriate index

Select the appropriate data type
(1) Use the smallest data type that can store data, integer < date, time < char, varchar < blob
(2) With simple data types, integer processing is less expensive than character processing because string comparison is more complex. For example, int type, storage time type, bigint type to IP function
(3) With reasonable field attribute length, fixed length tables will be faster. Use enum, char instead of varchar
(4) Use not null to define fields whenever possible
(5) Text should be used as little as possible. It is better to divide the table # and select the appropriate index column
(1) Columns frequently queried, columns appearing in where, group by, order by, on clauses
(2) Columns with <, < =, =, >, > =, between, in, and like string + wildcard (%) in the where condition
(3) For columns with small length, the smaller the index field, the better, because the storage unit of the database is page, and the more data can be stored in a page, the better
(4) Columns with large dispersion (many different values) are placed in front of the joint index. View the dispersion by counting different column values. The greater the count, the higher the dispersion:

The original developer has run away, and the table has already been established, so I can’t modify it, so: the wording can’t be implemented, give up!

2. Pay attention to optimization when writing SQL

1. Use limit to limit the records of query results

2. Avoid selecting * and list the fields to be found

3. Use join instead of subquery

4. Split large delete or insert statements

5. You can find the slow SQL by opening the slow query log

6. No column operation: select id where age + 1 = 10. Any column operation will lead to table scanning, including database tutorial functions, calculation expressions, etc. during query, move the operation to the right of the equal sign as much as possible

7. The SQL statement should be as simple as possible: one SQL can only be calculated in one CPU; Large statements are disassembled from small statements to reduce locking time; A large SQL can block the entire database

8. Or is rewritten as in: the efficiency of or is n level, the efficiency of in is log (n) level, and the number of in is recommended to be controlled within 200

9. It is implemented in the application without functions and triggers

10. Avoid% XXX type query

11. Use less join

12. Use the same type for comparison, such as’ 123 ‘and’ 123 ‘ratio, 123 and 123 ratio

13. Try to avoid using in where clause= Or < > operator, otherwise the engine will abandon the index and perform a full table scan

14. For continuous values, use between 1 and 5 instead of in: select id from t where num between

15. Do not take the whole table for the list data, use limit to page, and the number of each page should not be too large

The original developer has run away and the program has been launched. I can’t modify the SQL, so: the wording can’t be executed. Give up!

engine

At present, MyISAM and InnoDB engines are widely used:

MyISAM

MyISAM engine is the default engine for MySQL version 5.1 and earlier. It is characterized by:

1. Row locks are not supported. When reading, all tables to be read are locked, and when writing, exclusive locks are added to the tables

2. Transaction not supported

3. Foreign keys are not supported

4. Security recovery after crash is not supported

5. When the table has read query, it supports inserting new records into the table

6. Support the first 500 character index of blob and text and full-text index

7. Support delayed index update, greatly improving write performance

8. For tables that will not be modified, compressed tables are supported to greatly reduce the occupation of disk space

InnoDB

InnoDB becomes the default index after MySQL 5.5. Its features are as follows:

1. Support row lock and adopt mvcc to support high concurrency

2. Support services

3. Support foreign keys

4. Support safe recovery after crash

5. Full text indexing is not supported

In general, MyISAM is suitable for select intensive tables, while InnoDB is suitable for insert and update intensive tables

MyISAM may be very fast and occupy small storage space, but the program requires transaction support, so InnoDB is necessary, so the scheme cannot be implemented. Give up!

3. Zoning

The partition introduced by MySQL in version 5.1 is a simple horizontal split. Users need to add partition parameters when creating tables. It is transparent to applications and does not need to modify codes

For users, the partition table is an independent logical table, but the bottom layer is composed of multiple physical sub tables. The code to realize the partition is actually encapsulated by a group of objects of the bottom table, but for the SQL layer, it is a black box that completely encapsulates the bottom layer. The way MySQL implements partitioning also means that the index is also defined according to the sub table of the partition, and there is no global index

The user’s SQL statements need to be optimized for partition tables. Columns with partition conditions should be included in the SQL conditions, so that the query can be located on a small number of partitions. Otherwise, all partitions will be scanned. You can view through explain partitions that a SQL statement will fall on those partitions, so as to optimize SQL. I test that columns without partition conditions during query will also improve the speed, Therefore, this measure is worth trying.

The benefits of zoning are:

1. A single table can store more data

2. The data of the partition table is easier to maintain. You can delete a large amount of data in batch by clarifying the whole partition, or add a new partition to support the newly inserted data. In addition, you can also optimize, check and repair an independent partition

3. Some queries can only fall on a few partitions according to the query criteria, and the speed will be very fast

4. The data of the partition table can also be distributed on different physical devices, so it is funny to use multiple hardware devices

5. Partition table dependency can be used to avoid some special bottlenecks, such as mutually exclusive access of a single index of InnoDB and inode lock competition of ext3 file system

6. A single partition can be backed up and restored

Limitations and disadvantages of Zoning:

1. A table can only have 1024 partitions at most

2. If there are primary key or unique index columns in the partition field, all primary key columns and unique index columns must be included

3. Partition tables cannot use foreign key constraints

4. Null values invalidate partition filtering

5. All partitions must use the same storage engine

Type of partition:

1. Range partition: allocate multiple rows to the partition based on the column values belonging to a given continuous interval

2. List partition: similar to partition by range, the difference is that list partition is selected based on column values matching a value in a discrete value set

3. Hash partition: a partition selected based on the return value of a user-defined expression, which is calculated using the column values of these rows to be inserted into the table. This function can contain any valid expression in MySQL that produces a non negative integer value

4. Key partition: similar to hash partition, the difference is that key partition only supports the calculation of one or more columns, and MySQL server provides its own hash function. One or more columns must contain integer values

5. For the specific concept of MySQL partition, please Google or query the official documents by yourself. I’m just throwing a brick here to attract jade.

First, I divided the online record table range into 12 sections according to the month, and the query efficiency was improved by about 6 times. The effect was not obvious. Therefore, I changed the ID to hash and divided it into 64 sections, which significantly improved the query speed. Problem solved!

The results are as follows: partition by hash (ID) partitions 64

select count() from readroom_ website; — 11901336 line records

/   Rows affected: 0 records found: 1 warning: 0 duration 1 query: 5.734 sec  /

select * from readroom_website where month(accesstime) =11 limit 10;

/   Rows affected: 0 records found: 10 warning: 0 duration 1 query: 0.719 sec*/

4. Sub table

Table splitting is to divide a large table into multiple tables, divide a query into multiple queries, and then combine the results and return them to the user.

Split tables are divided into vertical split and horizontal split. Usually, a field is used as a split item. For example, split the ID field into 100 tables: the table name is tablename_ id%100

However, the sub table needs to modify the source code, which will bring a lot of work to the development and greatly increase the development cost. Therefore, it is only suitable for considering the existence of a large amount of data at the early stage of development and doing a good job in the sub table processing. It is not suitable for the modification after the application goes online. The cost is too high!!! Moreover, choosing this scheme is not as low as choosing the second and third scheme I provide! Therefore, it is not recommended.

5. Sub warehouse

Divide a database into multiple databases. It is recommended to separate reading and writing. The real sub database will also bring a lot of development costs, which is not worth the loss! Not recommended.

Detailed description of scheme 2: upgrade the database and change to a database 100% compatible with MySQL

MySQL performance is not good, so change it. In order to ensure that the source code is not modified and ensure the smooth migration of existing businesses, it is necessary to change a database 100% compatible with MySQL.

Open source choice

1.tiDB https://github.com/pingcap/tidb

2.Cubrid https://www.cubrid.org/

3. Open source database will bring a lot of operation and maintenance costs, and there is still a gap between its industrial quality and mysql. There are many pits to step on. If your company requires to build its own database, choose this type of product.

Cloud data selection

1. Alibaba cloud polardb

2.https://www.aliyun.com/produc…

Official introduction: polardb is the next-generation relational distributed cloud native database developed by Alibaba cloud. It is 100% compatible with MySQL, with a storage capacity of up to 100t and a performance of up to 6 times that of MySQL. Polardb not only integrates the characteristics of stability, reliability and high performance of commercial databases, but also has the advantages of simple, scalable and continuous iteration of open source databases, and the cost is only 1 / 10 of that of commercial databases.

I opened and tested it. It supports free MySQL data migration without operation cost. The performance is improved by about 10 times. The price is not much different from RDS. It is a good alternative solution!

1. Alibaba cloud ocenanbase

2. Taobao can carry the double 11 with outstanding performance, but in the public test, I can’t try, but it’s worth looking forward to

3. Alibaba cloud hybriddb for MySQL (formerly petadata)

4.https://www.aliyun.com/produc…

Official introduction: hybriddb for MySQL (formerly petadata) is an HTAP (hybrid transaction / analytical processing) relational database that supports both massive data online transaction (OLTP) and online analysis (OLAP).

I also tested it. It is an OLAP and OLTP compatible solution, but the price is too high, up to 10 yuan per hour. It is too wasteful for storage. It is suitable for the business of storage and analysis.

1. Tencent cloud DCDB

2.https://cloud.tencent.com/pro…

Official introduction: DCDB, also known as tdsql, is a high-performance distributed database that is compatible with MySQL protocol and syntax and supports automatic horizontal splitting – that is, the business is displayed as a complete logical table, but the data is evenly split into multiple slices; Each partition adopts the active and standby architecture by default, providing a full set of solutions such as disaster recovery, recovery, monitoring and non-stop capacity expansion, which is suitable for TB or Pb level massive data scenarios.

I don’t like Tencent. I won’t say much. The reason is that there is a problem, no one can be found, and the online problem can not be solved. Headache! But he is cheap and suitable for super small companies.

Detailed description of scheme 3: remove MySQL and change the big data engine to process data

The amount of data is over 100 million. There is no choice but to use big data.

Open source solutions

Hadoop family. Just connect HBase / hive. However, with high operation and maintenance costs, general companies can’t afford to play. Without 100000 investment, there will be no good output!

Cloud solutions

This is more. It is also a future trend. Big data is provided by professional companies and purchased by small companies or individuals. Like public facilities such as water / electricity, big data exists in all aspects of society.

Alibaba cloud is the best in China.

I chose Alibaba cloud maxcompute to cooperate with dataworks. It is super comfortable to use, pay as you go, and the cost is very low.

Maxcompute can be understood as an open source hive. It provides SQL / MapReduce / AI algorithm / Python script / shell script to operate data. The data is displayed in the form of tables, stored in a distributed manner, and processed in the form of scheduled tasks and batch processing. Dataworks provides a workflow way to manage your data processing tasks and schedule monitoring.

Of course, you can also choose Alibaba cloud HBase and other products. I mainly process offline here, so maxcompute is basically a graphical interface operation. About 300 lines of SQL are written, and the cost is no more than 100 yuan to solve the data processing problem.

Author: Wang Shuai
Source address:https://yq.aliyun.com/article…

There are three solutions to the optimization process of MySQL 20 million data large tables at a time!