Large data volume and high concurrency database optimization, SQL query optimization

Time:2020-10-1

1、 Design of database structure

If we can’t design a reasonable database model, it will not only increase the difficulty of programming and maintenance of client and server programs, but also affect the actual performance of the system. Therefore, before the implementation of a system, it is necessary to design a complete database model.

In a system analysis and design stage, because the amount of data is small, the load is low. We often only pay attention to the implementation of the function, but it is difficult to notice the weakness of the performance. When the system is put into actual operation for a period of time, we find that the performance of the system is decreasing. At this time, it will cost more manpower and material resources to consider improving the system performance, and the whole system inevitably forms a patching project.

Therefore, when considering the process of the whole system, we must consider whether our system will have extreme situation in the case of high concurrency and large amount of data access. (for example, the abnormal data in the external statistics system on July 16 was caused by concurrent large amount of data access, and the response time of the database could not keep up with the speed of data refresh. The specific situation is: at the date critical time (00:00:00), judge whether there is a record of the current date in the database, if not, insert a record of the current date. In the case of low concurrent access, the problem will not occur. However, when the access volume is quite large when the date is critical, when making this judgment, multiple conditions will appear, and multiple records of the current date will be inserted into the database, resulting in data error.) After the database model is determined, it is necessary to make a system data flow chart to analyze the possible bottlenecks.

In order to ensure the consistency and integrity of the database, in the logic design, too many tables are designed to reduce the redundancy of data as much as possible. (for example, the region of the user table can be stored in another region table) if the data redundancy is low, the integrity of the data is easy to be guaranteed, the speed of data throughput is improved, the data integrity is guaranteed, and the relationship between data elements is clearly expressed. However, the performance of association query between multiple tables (especially large data table) will be reduced, and the programming difficulty of client program will be increased. Therefore, the physical design needs to compromise. According to the business rules, the size of data volume and the access frequency of data items of association table should be determined. For frequent association query of such data table, the data redundancy design should be appropriately improved, but the increase of data redundancy should be improved In order to improve the response time of the system, reasonable data redundancy is also necessary. Designers should consider the type and frequency of system operation in the design phase.

In addition, it is better not to use auto increment attribute fields as primary keys to associate with sub tables. It is not convenient for system migration and data recovery. The mapping relationship of external statistical system is missing.

The original table must be able to be rebuilt from the table that was detached from it. The benefit of using this rule is that you can make sure that you don’t introduce extra columns into separate tables, and that all the table structures you create are as large as they really need to be. It’s a good habit to apply this rule, but you won’t need to use it unless you’re dealing with a very large amount of data. (for example, in a pass system, I can separate userid, username and userpassword into a table, and then use userid as the foreign key of other tables.)

The design of the table should pay attention to the following problems:

1. The length of the data row should not exceed 8020 bytes. If it exceeds this length, the data will occupy two rows in the physical page, resulting in storage fragmentation and reducing query efficiency.

2. The ability to use numeric field to select numeric type instead of string type (phone number) will reduce query and connection performance and increase storage overhead. This is because the engine processes queries and concatenates back to compare each character in a string one by one, whereas for numeric types, only one comparison is needed.

3. For the immutable character type char and the variable character type varchar are both 8000 bytes, char query is fast, but consumes storage space. Varchar query is relatively slow but saves storage space. You can choose flexibly when designing fields. For example, char can be selected for fields with little change in length such as user name and password, and varchar can be selected for fields with large change in length such as comments.

4. The length of the field should be set as short as possible on the premise of meeting the possible needs to the maximum extent, which can improve the efficiency of query and reduce the consumption of resources when building index.

 

2、 Query optimization

On the basis of realizing the functions, the number of access to the database is reduced as much as possible; the number of access rows to the table is reduced as much as possible by searching parameters, To minimize the result set, so as to reduce the network burden; to separate the operation as much as possible, to improve the response speed each time; when using SQL in the data window, try to put the index used in the first column of the selection; the structure of the algorithm should be as simple as possible; in the query, do not use too many wildcards, such as select * from T1 statement, if you want to use several columns, select several columns, such as: select col1, Col2 from T1; if possible, limit the number of rows in the result set as much as possible, such as select top 300, col2, col3 from T1, because in some cases, users do not need so much data.

In the case of no index, if the database looks for a piece of data, it must scan the whole table, traverse all the data once, and find the qualified records. In the case of a small amount of data, perhaps no obvious difference can be seen, but when the amount of data is large, this situation is extremely bad.

How SQL statements are executed in SQL Server? They are worried that their SQL statements will be misunderstood by SQL server. For example:

select * from table1 where name=’zhangsan’ and tID > 10000

And Implementation:

select * from table1 where tID > 10000 and name=’zhangsan’

Some people don’t know whether the execution efficiency of the above two statements is the same, because if the statement sequence is simple, the two statements are indeed different. If TID is an aggregate index, then the latter sentence can only look up the later 10000 records in the table. For the first sentence, first look up the whole table to see if there are several names of ‘Zhangsan’, and then according to the restrictions The condition TID > 10000 is used to propose the query result.

In fact, such concerns are unnecessary. There is a “query analysis optimizer” in SQL server, which can calculate the search conditions in the where clause and determine which index can reduce the search space of table scan, that is, it can realize automatic optimization. Although the query optimizer can automatically optimize the query according to the where clause, sometimes the query optimizer will not perform the fast query as you intended.

During the query analysis phase, the query optimizer looks at each stage of the query and decides whether it is useful to limit the amount of data that needs to be scanned. If a stage can be used as a scan parameter (Sarg), then it is said to be optimized, and the index can be used to obtain the required data quickly.

Sarg definition: an operation used to limit search, because it usually refers to a specific match, a match within a range of values or an and connection of two or more conditions. The form is as follows:

Column name operator < constant or variable > or < constant or variable > operator column name

Column names can appear on one side of the operator, while constants or variables can appear on the other side of the operator. For example:

Name =’zhang San ‘

Price > 5000

5000 < price

Name =’zhangsan’and price > 5000

If an expression does not satisfy the Sarg form, it cannot limit the scope of the search, that is, the SQL server must determine for each row whether it satisfies all the conditions in the where clause. So an index is useless for expressions that do not satisfy Sarg form.

Therefore, the most important thing to optimize the query is to make the statement conform to the rules of the query optimizer as much as possible, and avoid using index query instead of full table scanning.

What should be paid attention to:

1. We should try our best to avoid judging the null value of the field in the where clause, otherwise the engine will give up using the index and scan the whole table

select id from t where num is null

You can set the default value of 0 on num to ensure that there is no null value in num column in the table, and then query as follows:

select id from t where num=0

2. Try to avoid using the! = or < > operator in the where clause, otherwise the engine will abandon the use of index and scan the whole table. The optimizer will not be able to determine the number of rows to hit through the index, so it needs to search all rows of the table.

3. Try to avoid using or in the where clause to join conditions, otherwise the engine will abandon the use of index and scan the whole table

select id from t where num=10 or num=20

You can query as follows:

select id from t where num=10

union all

select id from t where num=20

4. In and not in should also be used with caution, because in will make the system unable to use the index and can only directly search the data in the table. For example:

select id from t where num in(1,2,3)

For continuous values, use between instead of in

select id from t where num between 1 and 3

5. Try to avoid searching with non leading letters in the indexed character data. This also prevents the engine from taking advantage of the index.

See the following example:

SELECT * FROM T1 WHERE NAME LIKE ‘%L%’

SELECT * FROM T1 WHERE SUBSTING(NAME,2,1)=’L’

SELECT * FROM T1 WHERE NAME LIKE ‘L%’

Even if the name field has an index, the first two queries still can’t use the index to complete the acceleration operation, and the engine has to operate on all the data in the whole table one by one to complete the task. The third query can use indexes to speed up operations.

6. If necessary, forcing the query optimizer to use an index, such as using parameters in the where clause, will also result in a full table scan. Because SQL only resolves local variables at run time, the optimizer cannot delay the selection of an access plan to run time; it must do so at compile time. However, if an access plan is established at compile time, the value of the variable is still unknown and cannot be used as an input for index selection. The following statement will scan the whole table:

select id from t where [email protected]

You can force the query to use the index instead:

Select id from t with (index) where [email protected]

7. We should try our best to avoid the expression operation on the field in the where clause, which will cause the engine to abandon the use of index and scan the whole table. For example:

SELECT * FROM T1 WHERE F1/2=100

It should be replaced by:

SELECT * FROM T1 WHERE F1=100*2

SELECT * FROM RECORD WHERE SUBSTRING(CARD_NO,1,4)=’5378’

It should be replaced by:

SELECT * FROM RECORD WHERE CARD_NO LIKE ‘5378%’

SELECT member_number, first_name, last_name FROM members

WHERE DATEDIFF(yy,datofbirth,GETDATE()) > 21

It should be replaced by:

SELECT member_number, first_name, last_name FROM members

WHERE dateofbirth < DATEADD(yy,-21,GETDATE())

That is, any operation on columns will result in table scanning, including database functions, calculation expressions, etc. when querying, the operation should be moved to the right of the equal sign as much as possible.

8. We should try our best to avoid function operation on the field in the where clause, which will cause the engine to abandon the use of index and scan the whole table. For example:

select id from t where substring(name,1,3)='abc'–Name ID starting with ABC

select id from t where datediff(day,createdate,'2005-11-30')=0–ID generated from ‘2005-11-30’

It should be replaced by:

select id from t where name like ‘abc%’

select id from t where createdate>=’2005-11-30′ and createdate<‘2005-12-1’

9. Do not perform function, arithmetic or other expression operations on the left of “=” in the where clause, otherwise the system may not be able to use the index correctly.

10. When the index field is used as a condition, if the index is a composite index, the first field in the index must be used as a condition to ensure that the system uses the index. Otherwise, the index will not be used, and the field order should be consistent with the index order as much as possible.

11. Many times, using exists is a good choice

elect num from a where num in(select num from b)

Replace with the following statement:

select num from a where exists(select 1 from b where num=a.num)

SELECT SUM(T1.C1)FROM T1 WHERE(

(SELECT COUNT(*)FROM T2 WHERE T2.C2=T1.C2>0)

SELECT SUM(T1.C1) FROM T1WHERE EXISTS(

SELECT * FROM T2 WHERE T2.C2=T1.C2)

Both produce the same result, but the efficiency of the latter is obviously higher than that of the former. Because the latter does not generate a large number of locked table scans or index scans.

If you want to check whether there is a record in the table, do not use count (*) which is inefficient and wastes server resources. You can use exists instead. For example:

IF (SELECT COUNT(*) FROM table_name WHERE column_name = ‘xxx’)

It can be written as:

IF EXISTS (SELECT * FROM table_name WHERE column_name = ‘xxx’)

You often need to write a T_ SQL statement compares a parent result set with a child result set to find out whether there are records in the parent result set but not in the child result set, such as:

SELECT a.hdr_key FROM hdr_tbl a—-TBL a means TBL is replaced by alias a

WHERE NOT EXISTS (SELECT * FROM dtl_tbl b WHERE a.hdr_key = b.hdr_key)

SELECT a.hdr_key FROM hdr_tbl a

LEFT JOIN dtl_tbl b ON a.hdr_key = b.hdr_key WHERE b.hdr_key IS NULL

SELECT hdr_key FROM hdr_tbl

WHERE hdr_key NOT IN (SELECT hdr_key FROM dtl_tbl)

All three methods can get the same correct results, but the efficiency decreases in turn.

12. Try to use table variables instead of temporary tables. If the table variable contains a large amount of data, note that the index is very limited (only the primary key index).

13. Avoid frequent creation and deletion of temporary tables to reduce the consumption of system table resources.

14. Temporary tables are not unusable, and using them appropriately can make some routines more efficient, for example, when you need to repeatedly refer to a dataset in a large table or a common table. However, for one-time events, it is best to use the export table.

15. When creating a new temporary table, if a large amount of data is inserted at one time, it can be usedselect intoreplacecreate tableTo avoid creating a large number of logs to improve the speed; if the amount of data is small, in order to ease the resources of the system table, you should first create a table and then insert it.

16. If temporary tables are used, all temporary tables must be explicitly deleted at the end of the stored procedure. Truncate table, then drop table, can avoid long-term locking of system tables.

17. Set NOCOUNT on at the beginning of all stored procedures and triggers, and set NOCOUNT off at the end. There is no need to send done to the client after each statement of the stored procedure and trigger is executed_ IN_ Proc message.

18. Try to avoid large transaction operation and improve the concurrent ability of the system.

19. Try to avoid returning large amount of data to the client. If the amount of data is too large, we should consider whether the corresponding demand is reasonable.

20. Avoid using incompatible data types. For example, float and int, char and varchar, binary and varbinary are incompatible. Incompatibility of data types may prevent the optimizer from performing some optimizations that could have been done. For example:

SELECT name FROM employee WHERE salary > 60000

In this statement, if the salary field is of the money type, it is difficult for the optimizer to optimize it because 60000 is an integer. We should program integers into coin types, not at runtime.

21. Make full use of join conditions. In some cases, there may be more than one join condition between two tables. At this time, it is possible to write the join conditions completely in the where clause, which may greatly improve the query speed.

For example:

SELECT SUM(A.AMOUNT) FROM ACCOUNT A,CARD B WHERE A.CARD_NO = B.CARD_NO

SELECT SUM(A.AMOUNT) FROM ACCOUNT A,CARD B WHERE A.CARD_NO = B.CARD_NO AND A.ACCOUNT_NO=B.ACCOUNT_NO

The second sentence will be executed much faster than the first.

22. Use view to speed up query

Sorting a subset of tables and creating views can sometimes speed up queries. It helps to avoid multiple sort operations, and in other ways simplifies the work of the optimizer. For example:

SELECT cust.name,rcvbles.balance,……other columns

FROM cust,rcvbles

WHERE cust.customer_id = rcvlbes.customer_id

AND rcvblls.balance>0

AND cust.postcode>“98000”

ORDER BY cust.name

If this query is to be executed multiple times but more than once, all outstanding customers can be found and placed in one view, and sorted by customer name

CREATE VIEW DBO.V_CUST_RCVLBES

AS

SELECT cust.name,rcvbles.balance,……other columns

FROM cust,rcvbles

WHERE cust.customer_id = rcvlbes.customer_id

AND rcvblls.balance>0

ORDER BY cust.name

Then query in the view in the following way:

SELECT * FROM V_CUST_RCVLBES

WHERE postcode>“98000”

There are fewer rows in the view than in the main table, and the physical order is the required order, reducing disk I / O, so the query workload can be greatly reduced.

23. If you can use distinct, you don’t need group by

SELECT OrderID FROM Details WHERE UnitPrice > 10 GROUP BY OrderID

It can be changed to:

SELECT DISTINCT OrderID FROM Details WHERE UnitPrice > 10

24. If you can use union all, don’t use union

Union all does not execute the select distinct function, which will reduce a lot of unnecessary resources

35. Try not to use the select into statement.

The select inot statement causes the table to lock, preventing other users from accessing the table.

What we mentioned above are some basic considerations to improve the query speed, but in more cases, it is often necessary to experiment and compare different statements to get the best solution. The best way is to test, of course, to see which SQL statement that implements the same function takes the least execution time. However, if there is a small amount of data in the database, it can not be compared. At this time, you can use the view execution plan, that is, test multiple SQL statements with the same function to the query analyzer, and press Ctrl + L to see the index used, and the number of table scans (these two have the greatest impact on performance )In general, it is OK to inquire about the cost percentage.

3、 Optimization of algorithm

Try to avoid using cursors, because cursors are inefficient. If cursor operations have more than 10000 rows of data, you should consider rewriting them. Before using the cursor based method or the temporary table method, we should find a set based solution to solve the problem. The set based method is usually more effective. Like temporary tables, cursors are not unusable. Using fast for small datasets_ Forward cursors are generally superior to other row by row processing methods, especially when several tables must be referenced to obtain the required data. Routines that include “totals” in the result set are usually faster than using cursors. If development time allows, both cursor based and set based methods can be tried to see which method works better.

Cursors provide a means to scan a specific set line by line. Generally, cursors are used to traverse the data line by line, and different operations are performed according to different conditions of the extracted data. Especially for multi table and large table defined cursor (large data set) loop is easy to make the program enter a long wait or even crash.

In some cases, sometimes it is necessary to use cursors. At this time, we can also consider transferring the qualified data rows into the temporary table, and then defining cursors on the temporary table to operate. The timeliness performance is significantly improved.

(e.g. internal statistics, first edition)

Encapsulating stored procedures

4、 Build efficient index

There are two purposes to create an index: to maintain the uniqueness of the indexed columns and to provide fast access to the data in the table. There are two kinds of indexes in large database, namely cluster index and non cluster index. A table without cluster index stores data by heap structure, and all data is added at the end of the table. However, for the table with cluster index, its data will be physically stored in the order of cluster index key, and only one cluster index is allowed for a table. Therefore, according to the B-tree structure, it can be understood to add any kind of index Indexes can improve the speed of query by index column, but reduce the performance of insert, update and delete operations, especially when the fill factor is large. Therefore, frequently insert, update and delete the tables with more indexes, and set a smaller filling factor when creating tables and indexes, so as to leave more free space in each data page and reduce the work of page segmentation and reorganization.

Index is one of the most efficient ways to get data from database. 95% of database performance problems can be solved by index technology. As a rule, I usually use a unique grouped index for a logical primary key, a unique non grouped index for a system key (as a stored procedure), and a non grouped index for any foreign key column [field]. However, the index is like salt, too much food will be salty. You have to think about how much space the database has, how the tables are accessed, and whether these accesses are primarily for read and write.

In fact, you can think of an index as a special kind of catalog. Microsoft SQL server provides two kinds of indexes: clustered index and nonclustered index. Let’s take an example to illustrate the difference between clustered and nonclustered indexes

In fact, the body of our Chinese Dictionary itself is a clustered index. For example, if we want to look up the word “an”, we will naturally open the first few pages of the dictionary. Because the Pinyin of “an” is “an”, and the dictionary of Chinese characters sorted according to Pinyin starts with the English letter “a” and ends with “Z”, then the word “an” naturally ranks at the front of the dictionary. If you look through all the parts beginning with “a” and still can’t find the word, it means that there is no such word in your dictionary; similarly, if you look up the word “Zhang”, you will also turn your dictionary to the last part, because the Pinyin of “Zhang” is “Zhang”. In other words, the body of the dictionary itself is a directory, you do not need to look up other directories to find what you need to find.

We call this kind of text content itself a kind of table of contents arranged according to certain rules, which is called “clustered index”.

If you know a word, you can quickly find it from the automatic. However, you may encounter a word that you don’t know, and you don’t know its pronunciation. At this time, you can’t find the word you want according to the method just now, but you need to find the word you want according to the “radical”, and then turn to a page according to the page number after the word to find the word you are looking for. However, the sorting of the words you find by combining the “radical catalog” and the “character Checklist” is not the real sorting method of the text. For example, if you look up the character “Zhang”, we can see that the page number of “Zhang” in the table after the radical is 672, the word “Chi” is above “Zhang”, but the page number is 63, the word “crossbow” is below “Zhang”, and the page is 390. Obviously, these words are not really located at the top and bottom of the word “Zhang”. The continuous words “Chi, Zhang, and Nu” you see now are actually their sorting in the nonclustered index, which is the mapping of the words in the dictionary body in the nonclustered index. We can find the word you need in this way, but it requires two processes: first find the result in the table of contents, and then turn to the page number you need.

We call this sort of sorting method that the catalog is purely a table of contents, and the body is purely the text, which is called “nonclustered index”.

Further, we can easily understand that each table can only have one clustered index, because the catalog can only be sorted in one way.

(1) When to use clustered or nonclustered indexes

The following table summarizes when to use clustered or nonclustered indexes, which are important.

Action description using clustered indexes using nonclustered indexes

Columns are often grouped and sorted should

Returns whether the data in a range should be

One or very few different values should not

Should different decimal values be

A large number of different values should not

Frequently updated columns should not

Foreign key columns should

Primary key column should be

Frequent modification of index columns should not

 

In fact, we can understand the above table through the examples of the definitions of clustered index and nonclustered index. For example: return a data item in a certain range. For example, a table has a time column, and you have built the aggregate index in this column. At this time, when you query all the data between January 1, 2004 and October 1, 2004, this speed will be very fast, because your dictionary body is sorted by date, and the clustering index only needs to find the beginning and end data in all the data to be retrieved Like nonclustered index, you must first find the page number corresponding to each item of data in the directory, and then find out the specific content according to the page number.

 

(2) On the misunderstanding of index using

The purpose of the theory is application. Although we have just listed when to use clustered index or nonclustered index, in practice, the above rules are easy to be ignored or cannot be comprehensively analyzed according to the actual situation. Next, we will talk about the misunderstanding of index use according to the practical problems encountered in practice, so as to facilitate everyone to master the method of index establishment.

1. A primary key is a clustered index

This idea is extremely wrong and a waste of clustered index. Although SQL server establishes a clustered index on the primary key by default.

Usually, we will create an ID column in each table to distinguish each data, and the ID column is automatically increased, and the step size is generally 1. This is the case with the column GID in our example of office automation. At this point, if we set this column as the primary key, SQL server will default this column to a clustered index. This has the advantage of allowing your data to be physically sorted by ID in the database, but I don’t think it makes sense.

Obviously, the advantage of clustered index is obvious, and there can only be one clustered index in each table, which makes the clustered index more valuable.

From the definition of clustered index we mentioned earlier, we can see that the biggest advantage of using a clustered index is that it can quickly narrow the query range and avoid full table scanning according to the query requirements. In practical application, because the ID number is automatically generated, we do not know the ID number of each record, so it is difficult to query with ID number in practice. This makes it a waste of resources to use the ID number as a clustered index. Secondly, it does not conform to the rule of “no aggregate index should be established when a large number of different values are used as a clustered index for each field with different ID numbers. Of course, this situation is only aimed at users’ frequent modification of record contents, especially when index entries are negative, but it has no impact on query speed.

In the office automation system, whether it is the documents displayed on the front page of the system that need to be signed by the user, the meeting, or the user’s file query, the field “date” and the user’s own “user name” are indispensable for data query in any case.

Usually, the home page of office automation will display the documents or meetings that each user has not signed in. Although our where statement can only limit the situation that the current user has not signed in, if your system has been established for a long time and there is a large amount of data, then every time each user opens the home page, a full table scan will be carried out. This is of little significance. Most users have browsed the files a month ago, so it can only increase the number It’s just the cost of the database. In fact, we can allow users to open the home page of the system, and the database can only query the files that the user hasn’t read in the past three months, and limit the table scanning through the “date” field to improve the query speed. If your office automation system has been established for 2 years, then your home page display speed will be 8 times of the original speed, or even faster.

2. As long as the index is established, the query speed can be significantly improved

In fact, we can find that in the above example, statements 2 and 3 are exactly the same, and the index fields are the same; the difference is that the former creates a non aggregate index on the fariqi field, while the latter establishes an aggregate index on this field, but the query speed is quite different. Therefore, it is not easy to index any field to improve the query speed.

From the statement of creating the table, we can see that there are 5003 different records in the fariqi field of this table with 10 million data. An aggregate index on this field is perfect. In reality, we will send several files every day, and these files will be published on the same date, which fully conforms to the rule of “neither the vast majority of the same nor only a few of the same” for building a clustered index. From this point of view, it is very important for us to build an “appropriate” aggregate index to improve the query speed.

3. Add all the fields that need to improve the query speed into the clustered index to improve the query speed

It has been mentioned above: in data query, the fields “date” and “user name” are indispensable. Since these two fields are so important, we can combine them to create a composite index.

Many people think that adding any field into the clustered index can improve the query speed, while others are confused: if the composite clustered index fields are queried separately, will the query speed be slowed down? With this problem, let’s take a look at the following query speed (the result set is 250000 pieces of data): (the date column fariqi is first placed in the starting column of the composite clustered index, and the user name Neibu Yonghu is in the last column)

We can see that if only the starting column of the clustered index is used as the query condition and all the columns of the composite index are used at the same time, the query speed is almost the same, or even slightly faster than using all the composite index columns (when the number of query result sets is the same); if only the non starting column of the composite index is used as the query condition, the index is not Any effect. Of course, the query speed of statements 1 and 2 is the same because the number of query entries is the same. If all columns of the composite index are used and the query results are small, then “index coverage” will be formed, and the performance can be optimized. At the same time, keep in mind that whether you use other columns of the aggregate index frequently or not, its leading column must be the most frequently used column.

(3) Other considerations

“Water can carry a boat, it can also capsize a boat”, the index is the same. Indexing can improve retrieval performance, but too many or improper indexes can also lead to system inefficiency. Because every time a user adds an index to a table, the database has to do more work. Too many indexes can even lead to index fragmentation.

Therefore, we need to establish an “appropriate” index system, especially for the creation of aggregate index, we should strive for perfection, so that your database can get high performance