abstract: the core of million level and ten million level data processing lies in the design of data storage scheme. Whether the design of storage scheme is reasonable directly affects the operation of data crud. The overall design can consider the following aspects: data storage structure design; Index design; Data primary key design; Query scheme design.
In my opinion, the key to the processing of millions and tens of millions of data lies in the design of data storage scheme. Whether the design of storage scheme is reasonable directly affects the operation of data crud. The following aspects can be considered in the overall design:
- Data storage structure design
- Index design
- Data primary key design
- Query scheme design
Million level data processing scheme:
Data storage structure design
Table field design
the table field is not null, because the null value is difficult to query and optimize and takes up additional index space, the default number 0 is recommended.
for data status type fields, such as status, type, etc., try not to define negative numbers, such as – 1. Because unsigned can be added in this way, the numerical capacity will be doubled.
replace int with tinyint and smallint if possible, and try not to use bigint because it takes up less space.
the field of string type will occupy more space than that of digital type, so try to use integer instead of string. In many scenarios, integer can be used instead through coding logic.
the length of string type should not be set arbitrarily, and it should be as small as possible on the premise of meeting the business requirements.
use integer to store IP.
a single table should not have too many fields, and it is recommended that it be within 20.
reserved in advance for predictable fields, because the larger the amount of data, the more time-consuming it is to modify the data structure.
index, the optimization strategy of space for time. Basically, the index is designed according to the business needs, which is enough to cope with the amount of millions of data, and form the habit of using explain. You can also access explain: explain makes your SQL write more down-to-earth and learn more.
a common sense: the more indexes, the better. Indexes will reduce data writing performance.
the length of the index field shall be as short as possible, which can save a lot of index space;
cancel the foreign key, which can be constrained by the program, with better performance.
match the leftmost column rule of the composite index, keep the index order consistent with the query conditions, and try to remove unnecessary single column indexes.
fields with less value distribution (less duplicates) are not suitable for indexing. For example, when there are only two or three values such as gender, it is of little significance to index fields.
it is recommended to add an index to the fields to be sorted, because the index will sort, which can improve the query performance.
prefix index is used for string fields instead of full field index, which can greatly reduce the index space.
try to use short queries instead of complex inline queries.
select * is not used for query. Try to query fields with indexes to avoid returning to the table.
try to use limit to limit the number of queries.
the query field should fall on the index as much as possible, especially the composite index. More attention should be paid to the leftmost prefix matching.
splitting a large delete / insert operation will lock the table and affect other business operations. On the other hand, MySQL also has restrictions on the length of SQL.
it is not recommended to use MySQL functions and calculations, which can be handled by the program first. From some points mentioned above, it will be found that those that can be handled by the program should not transfer the pressure to the database as far as possible. Because most of the server performance bottlenecks are on the database.
query count, performance: count (1) = count (*) > count (primary key) > count (other fields).
if the query operator can use between, it does not need in, and if it can use in, it does not need or.
avoid using= Or operators such as < >, is null or is not null, in, not in, etc., because these queries cannot use indexes.
SQL should be as simple as possible and use fewer joins. More than two joins are not recommended.
Ten million level data processing scheme:
Data storage structure design
At this stage, the data itself has great value. In addition to meeting the regular business needs, the data also has some data analysis needs. At this time, the data variability is not high, and the original structure will not be considered to be modified. Generally, optimization will be considered from three aspects: partition, table and database:
Partition is a horizontal partition in which a database decomposes a table into several smaller and easier to manage parts according to certain rules. It is completely transparent to the application and does not affect the business logic of the application, that is, there is no need to modify the code. Therefore, it can save more data, query and delete. It also supports the operation by partition, so as to achieve the purpose of optimization. If zoning is considered, you can make preparations in advance to avoid the following restrictions:
- A table can only have 1024 partitions at most (8192 partitions are supported after MySQL 5.6). But when you actually operate, you’d better not open more than 100 partitions at one time, because opening partitions also costs time.
- If there is a primary key or unique index column in the partition field, all primary key columns and unique index columns must be included. If there is a primary key or unique index in the table, the partition key must be a primary key or unique index.
- Foreign key constraints cannot be used in partitioned tables.
- Null value will invalidate partition filtering, which will be put into the default partition. Please do not make the partition field null.
- All partitions must use the same storage engine.
The water table is divided into horizontal table and vertical table.
Horizontal split tables are split into small tables with the same data structure, such as table1, table2… To alleviate the reading and writing pressure of the database.
Vertical table splitting means that some fields are separated to form a new table. The data structures of each table are different. You can optimize the high and lock the table.
It is conceivable that the logic of the program needs to be modified in the case of table splitting. Therefore, table splitting is generally considered only when a large amount of data is foreseen in the early stage of the project. Table splitting is not recommended in the later stage, which is very expensive.
The sub database is generally in the master-slave mode. The master node of a database server is copied to one or more slave nodes and multiple databases. The master database is responsible for writing and the slave database is responsible for reading, so as to achieve the optimization purposes of master-slave separation, high availability, data backup and so on.
Of course, the master-slave mode also has some defects, such as master-slave synchronization delay, problems caused by too large binlog files, and so on. I won’t go into detail here (I can’t learn any more).
Heat and cold meter isolation. For historical data, when there are few people to query and use, it can be moved to another cold database to provide query only to alleviate the large amount of heat meter data.
Database table primary key design
Database primary key design, personal recommendation, self growing digital ID with time attribute（ Distributed self growing ID generation algorithm)
- Snowflake algorithm
- Baidu distributed ID algorithm
- Meituan distributed ID algorithm
Why use these algorithms? This is related to the MySQL data storage structure
In terms of business
When designing the database, you don’t need to think about which field to set as the primary key. Then, these fields are only theoretically unique. For example, the book number is used as the primary key. This book number is only theoretically unique, but it may be repeated in practice. Therefore, set a business independent self increasing ID as the primary key, and then add a uniqueness constraint of book number.
- If the table uses a self incrementing primary key, each time a new record is inserted, the record will be added to the subsequent position of the current index node in order. When a page is full, a new page will be opened automatically. In general, it can improve the performance of query and insertion.
- For InnoDB, the primary key index stores both index values and row data in the leaf node, that is, the data file itself stores data in the B + tree mode.
- If no primary key is defined, a non empty unique key will be used as the primary key; If there is no non empty unique key, the system generates a 6-byte ROWID as the primary key; In a clustered index, n rows form a page (a page is usually 16K in size). In case of irregular data insertion, in order to maintain the balance of B + tree, it will cause frequent page splitting and page rotation, and the insertion speed is relatively slow. Therefore, the primary key value of the cluster index should be a continuously increasing value rather than a random value (do not use a random string or UUID).
- Therefore, for the primary key of InnoDB, try to use integer and incremental integer. This is very efficient in storage / query.
MySQL interview questions
Optimization scheme of ten million level data query in MySQL database
The later the limit paging query, the slower the query. This also leads us to a conclusion:
1. The query time of the limit statement is directly proportional to the position of the starting record.
2. The MySQL limit statement is very convenient, but it is not suitable for direct use for tables with many records
The table uses InnoDB as the storage engine, ID as the self incrementing primary key, and the default is the primary key index
SELECT id FROM test LIMIT 9000000,100;
At present, there are two optimized schemes, that is, using sub query as query condition through ID and using join;
1. The (sub query) form of ID > = is implemented
select * from test where id >= (select id from test limit 9000000,1)limit 0,100
Use the form of join;
SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 9000000,100) b ON a.id = b.id
The use time of these two optimized queries is relatively close. In fact, they both use the same principle, so the effect is almost the same. However, I personally suggest that it is best to use join to minimize the use of sub queries. Note: at present, it is ten million level query. If it is to one million level, the speed will be faster.
SELECT * FROM test a JOIN (SELECT id FROM test LIMIT 1000000,100) b ON a.id = b.id
You have used MySQL storage engines. What are their characteristics and differences?
This is a question that senior developers often ask during interviews. In fact, we often encounter problems in our usual development. There are so many kinds of MySQL storage engines. In fact, InnoDB and MyISAM are the most commonly used. If the interviewer asks which storage engines MySQL has, you just need to tell them.
- What are their characteristics and differences?
MyISAM: the default table type is based on the traditional ISAM type. ISAM is the abbreviation of indexed sequential access method. It is a standard method for storing records and files. It is not transaction safe and does not support foreign keys. If a large number of selections are performed, insert MyISAM is more suitable.
InnoDB: it is an engine that supports transaction security. It supports foreign keys, row locks and transactions. If there are a large number of updates and inserts, it is recommended to use InnoDB, especially for multiple concurrency and high QPS. Note: in versions before MySQL 5.5, the default search engine is MyISAM. From versions after MySQL 5.5, the default search engine is changed to InnoDB
- Differences between MyISAM and InnoDB:
- InnoDB supports transactions, but MyISAM does not. For InnoDB, each SQL language is encapsulated into a transaction by default and submitted automatically, which will affect the speed. Therefore, it is best to put multiple SQL languages between begin and commit to form a transaction;
- InnoDB supports foreign keys, but MyISAM does not.
- InnoDB is a clustered index. B + tree is used as the index structure. The data file is bound with the (primary key) index (the table data file itself is an index structure organized by B + tree). There must be a primary key. Indexing through the primary key is very efficient. MyISAM is a non clustered index. It also uses B + tree as the index structure. The index is separated from the data file, and the index saves the pointer of the data file. Primary and secondary indexes are independent.
- InnoDB does not save the specific number of rows in the table. When you execute select count (*) from table, you need to scan the whole table. MyISAM uses a variable to save the number of rows in the whole table. When executing the above statement, you only need to read the variable, which is very fast.
- InnoDB does not support full-text indexing, while MyISAM supports full-text indexing. MyISAM has higher query efficiency; InnoDB after 5.7 supports full-text indexing.
- InnoDB supports table and row level locks (default), while MyISAM supports table level locks.;
- InnoDB table must have a primary key (if the user does not specify it, he will find or produce a primary key himself), but MyISAM can not.
- InnoDB storage files include frm and IBD, while MyISAM is frm, MyD and MYI.
- InnoDB: frm is the table definition file and IBD is the data file.
- MyISAM: frm is the table definition file, MyD is the data file, and MYI is the index file.
How would you optimize MySQL complex query statements?
When it comes to complex SQL optimization, the most important thing is that a large number of complex SQL statements are caused by multi table Association. How can we optimize this kind of SQL? The actual optimization also has a routine, as long as it is executed according to the routine. Complex SQL optimization scheme:
- Check SQL using explain keyword. Explain can help you analyze the performance bottleneck of your query statement or table structure. The query result of explain will also tell you how your index primary key is used, how your data table is searched and sorted, whether there is a full table scan, etc;
- Index fields shall be used for query conditions as much as possible. If a table has multiple conditions, composite index query shall be used as much as possible. When using composite index, attention shall be paid to the sequence of fields.
- Join is used as much as possible for multi table association to reduce the use of sub queries. If the associated field of the table can use the primary key, use the primary key, that is, use the index field as much as possible. If the associated field is not an index field, you can consider adding an index according to the situation.
- Try to use limit for paging batch query, and don’t get all at once.
- Absolutely avoid the use of select * and try to select the specific required fields to reduce the query of unnecessary fields;
- Try to convert or to union all.
- Try to avoid using is null or is not null.
- Pay attention to the use of like. Pre fuzzy and full fuzzy do not go through the index.
- Minimize the use of functions in the query fields after where, because functions will cause index invalidation.
- Avoid not equal to (! =), Because it doesn’t use indexes.
- Using exists instead of in, not exists instead of not in, the efficiency will be better;
- Avoid using the having clause. Having only filters the result set after retrieving all records. This processing requires sorting, totals and other operations. If you can limit the number of records through the where clause, you can reduce this overhead.
- Never order by Rand ()