Teach you how to understand SQL server query plan

Time:2021-6-24

For SQL Server optimization, query optimization may be a very common thing. As the optimization of the database itself is also a wide range of topics, so this paper only talks about how to understand the SQL server query plan when optimizing the query. After all, my knowledge of SQL server is limited. If there are any mistakes, please criticize and correct them in time.

First, open SQL Server Management Studio and enter a query statement to see how SQL server displays the query plan.
Note: the database demonstrated in this paper is a database specially prepared by me for a demo program, which can be downloaded from this webpage.

select v.OrderID, v.CustomerID, v.CustomerName, v.OrderDate, v.SumMoney, v.Finishedfrom   OrdersView as vwhere v.OrderDate >= '2010-12-1' and v.OrderDate < '2011-12-1';

Among them, ordersview is a view, which is defined as follows:

SELECT     dbo.Orders.OrderID, dbo.Orders.CustomerID, dbo.Orders.OrderDate,             dbo.Orders.SumMoney, dbo.Orders.Finished,             ISNULL(dbo.Customers.CustomerName, N'') AS CustomerNameFROM         dbo.Orders LEFT OUTER JOIN                dbo.Customers ON dbo.Orders.CustomerID = dbo.Customers.CustomerID

For the previous query, the query plan given by SQL server is as follows (click the [display estimated execution plan] button on the toolbar)

From this graph, we can get at least three useful information

1. Which steps cost more. Obviously, the cost of the two steps on the right is relatively high.
2. Which execution steps generate more data. For the amount of data generated by each step, the execution plan of SQL server is represented by [line thickness], so it is easy to distinguish it from other steps.
3. What kind of action is performed in each step.

For a slow query, we usually need to know which steps are more expensive, and then we can try some improvement methods. Generally speaking, if you can’t solve the problem by improving the hardware performance or adjusting the OS and SQL server settings, the remaining options are usually as follows:

1. Add the index of the corresponding field for the operation of [scan].
2. Sometimes it may be effective to rebuild the index. Please refer to the following for details.
3. Adjust the sentence structure and guide SQL server to use other query schemes.
4. Adjust the table structure (sub table or partition).

Now let’s talk about some important theoretical knowledge, which is very helpful for the understanding of the implementation plan.

The method of finding records in SQL Server

At this point, I have to say SQL Server index. SQL server has two kinds of indexes: clustered index and nonclustered index. The difference between the two is that the clustered index directly determines the location of the records, or the records can be obtained directly according to the clustered index【 Nonclustered index] stores two information: 1. The value of the corresponding index field; 2. The location of the corresponding clustered index is recorded (if the table has no clustered index, the record pointer is saved). Therefore, if you can find records through the clustered index, it is obviously the fastest.

SQL server will have the following methods to find the data records you need:

1. [table scan]: traverse the entire table to find all matching record rows. This operation will be checked line by line. Of course, the efficiency is the worst.
2. [index scan]: according to the index, some records are filtered out from the table, and then all the matching record rows are searched. Obviously, the search range of this method is smaller than that of the first method, so it is faster than that of [table scan].
3. [index seek]: according to the index, locate (obtain) the storage location of the record, and then obtain the record. Therefore, it will be faster than the first two methods.
4. [clustered index scan]: the same as [table scan]. Note: don’t think there is an index here, just think it’s different. In fact, it means: scan each row of records by clustered index, because records are stored in the order of clustered index. But [table scan] just says that the table to be scanned has no clustered index, so the two operations are essentially the same.
5. [clustered index seek]: get records directly according to the clustered index, the fastest!

Therefore, when a query is found to be slow, you can first check which operations cost more, and then check whether those operations are table scan or clustered index scan when looking for records. If they are related to these two operation types, you should consider adding an index to solve the problem. However, adding an index will also affect the modification of the data table, because when modifying the data table, you need to update the index of the corresponding field. So too many indexes can also affect performance. There is another situation that is not suitable for adding indexes: the state of a field represented by 0 or 1. For example, most of them may be 1, so it is meaningless to add index at this time. At this time can only consider for 0 or 1 these two cases separately to save, sub table or partition is a good choice.

If you can’t solve this problem by adding indexes and adjusting tables, you can try to adjust the sentence structure and guide SQL server to adopt other query schemes. This method requires: 1. Be clear about the function of the statement; 2. Be clear about the data table structure to be queried; 3. Be clear about the relevant business background knowledge. If we can solve it in this way, of course, it is also a good solution. However, sometimes SQL server is more intelligent, even if you adjust the sentence structure, it will not affect its execution plan.

How to compare the performance of two SQL statements with the same function? I suggest two methods: 1. Directly put the two query statements in SQL Server Management Studio, and then look at their execution plan. SQL server will tell you the query cost of the two queries in percentage. This method is simple and can be used for reference. However, it is sometimes inaccurate. For specific reasons, please continue to look down (maybe the index statistics are too old).
2. According to the real program call, write the corresponding test code to call: this method is more troublesome, but it can better represent the real call situation, and the results are more valuable, so it is worth it.

SQL Server join mode

In SQL server, each join command runs in three more specific ways during internal execution:

1. [nested loops join]. If one join has a small input and the other has a large input and an index has been created on its join column, the indexed nested loops join is the fastest join operation, because they require the least I / O and comparison.

Nested loop joins, also known as nested iterations, use one join input as an external input table (shown as the top input in the graph execution plan) and another join input as an internal (bottom) input table. The external loop processes the external input table line by line. An internal loop is executed for each external row, searching for matching rows in the internal input table. It can be understood by the following pseudo code:

foreach(row r1 in outer table)      foreach(row r2 in inner table)          If (R1, R2 meet the matching conditions)              output(r1, r2);

The simplest case is to scan the entire table or index during the search; This is called simply nested loop join. If an index is used in a search, it is called an index nested loop join. If the index is generated as part of the query plan (and the index is destroyed immediately after the query is completed), it is called “temporary index nested loop join”. The query optimizer considers all of these different situations.

Nested loop joins are especially effective when the external input is small and the internal input is large and the index is pre created. In many small transactions, such as those that affect only a small set of rows, index nested loop joins are superior to merge joins and hash joins. But in large queries, nested loop join is usually not the best choice.

2. [merge join], if two join inputs are not small but have been sorted on their join columns (for example, if they are obtained by scanning the sorted index), then merge join is the fastest join operation. If both join inputs are large and the two inputs are similar in size, the pre sorted merge join provides similar performance as the hash join. However, if the sizes of the two inputs are very different, the hash join operation is usually much faster.

A merge join requires that both inputs be sorted on the merge column, which is defined by the equivalent (on) clause of the join predicate. Typically, the query optimizer scans the index (if there is an index on the appropriate set of columns) or places a sort operator below the merge join. In rare cases, although there may be multiple equivalent clauses, only some of the available equivalent clauses are used to obtain the merged column.

Since each input is sorted, the merge join operator takes a row from each input and compares it. For example, for an inner join operation, returns if the rows are equal. If the rows are not equal, discard the row with the smaller value and get another row from the input. This process will be repeated until all actions are processed.

The merge join operation can be a normal operation or a many to many operation. Many to many merge joins use temporary tables to store rows (which affects efficiency). If there are duplicate values in each input, when processing each duplicate in one input, the other input must rewind to the beginning of the duplicate. You can create a unique index to tell SQL server that there will be no duplicate values.

If there is a resident predicate, all rows that satisfy the merge predicate will take the value of the resident predicate, and only those that satisfy the resident predicate will be returned.

The merge join itself is fast, but if you need to sort, choosing a merge join is time-consuming. However, if there is a large amount of data and the required data can be obtained from the existing b-tree index, merge join is usually the fastest join algorithm available.

3. [hash join], hash join can deal with unordered large non index input effectively. They are useful for intermediate results of complex queries because: 1. The intermediate results are not indexed (unless they have been explicitly saved to disk and then indexed), and they are usually not properly sorted for the next operation in the query plan. 2. The query optimizer only estimates the size of intermediate results. For complex queries, the estimation may have a large error, so if the intermediate result is much larger than expected, the algorithm for processing the intermediate result must not only be effective, but also be moderately weakened.

Hash joins can reduce the use of denormalization. Denormalization generally achieves better performance by reducing join operations, despite the risk of redundancy (such as inconsistent updates). Hash joins reduce the need to use denormalization. Hash joins make vertical partitioning (representing groups of columns in a single table with separate files or indexes) a viable option for physical database design.

There are two types of inputs for hash join: generate input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is used as the build input.

Hash join is used for a variety of setting matching operations: internal join; Left outer join, right outer join and full outer join; Left half join and right half join; Intersection; Union and difference. In addition, some variant of hash join can be used for re deletion and grouping, such as sum (salary) group by department. These modifications use only one input for the build and probe roles.

Hash join is divided into three types: in memory hash join, grace hash join and recursive hash join.

Hash join in memory: hash join scans or calculates the whole generated input, and then generates hash table in memory. According to the calculated hash value of the hash key, each row is inserted into the hash bucket. If the entire build input is less than the available memory, all rows can be inserted into the hash table. The generation phase is followed by the detection phase. Scan or calculate the whole probe input one line at a time, calculate the hash key value for each probe line, scan the corresponding hash bucket and generate a match.

Grace hash join: if the generated input is larger than the memory, the hash join will be divided into several steps. This is called a “grace hash join.”. Each step is divided into generation phase and detection phase. First, consume the entire build and probe input and partition it (using the hash function on the hash key) into multiple files. Using hash function on hash key can ensure that any two join records must be in the same file pair. Therefore, the task of connecting two large inputs is simplified to multiple smaller instances of the same task. The hash join is then applied to each pair of partition files.

Recursive hash join: if the generation input is so large that multiple merge levels are required for standard external merge input, multiple partition steps and multiple partition levels are required. If only some of the partitions are large, only additional partitioning steps need to be used for those partitions. To make all partitioning steps as fast as possible, large asynchronous I / O operations will be used so that a single thread can busy multiple disk drives.

It is not always possible to determine which hash join to use during optimization. Therefore, SQL server uses hash join in memory at the beginning, and then gradually converts to grace hash join and recursive hash join according to the size of generated input.
If the optimizer mistakenly predicts which of the two inputs is smaller and thus determines which as the build input, the build and probe roles will be dynamically reversed. Hash joins ensure that a smaller overflow file is used as the build input. This technique is called “role reversal.”. Role reversal does not occur in hash join until at least one file overflows to disk.

Note: you can also explicitly specify the connection mode. SQL server will respect your choice as much as possible. For example, you can write as follows: inner loop join, left outer merge join, inner hash join
However, I still suggest that you do not do so, because the choice of SQL server is basically correct, you can try it if you don’t believe it.

Well, I’ve talked about a lot of theoretical things. Let’s explain it with another practical example.

More specific implementation process

Previously, I gave a picture, which reflects the execution plan of SQL Server in executing a query, but the information it reflects may not be too detailed. Of course, you can move the mouse indicator to a node, and the following information will appear:

Just in time, I installed the Chinese version, which is full of Chinese characters, and I won’t say more. What I want to say is another way of execution, which contains more execution information than this one, and is the actual execution( Of course, you can continue to use the graphic mode and click the [include actual execution plan] button on the toolbar before running the query

Let’s go back to SQL Server Management Studio again, enter the following statement, and then execute it.

set statistics profile on select v.OrderID, v.CustomerID, v.CustomerName, v.OrderDate, v.SumMoney, v.Finishedfrom   OrdersView as vwhere v.OrderDate >= '2010-12-1' and v.OrderDate < '2011-12-1';

Note: now we add a sentence, “set statistics profile on”, and the result is as follows:

As you can see from the picture, after executing the query, you get two tables. The table above shows the query results, and the table below shows the query execution process. Compared with the first picture in this article, this picture may not be very friendly intuitively, but it can reflect more information, and it may look easier especially in complex queries, because there are too many steps in the execution plan for complex queries, and the graph will be too large to observe. Moreover, this implementation process table can reflect two very valuable data (the first two columns).

Let’s take a look at the execution process table. Let me pick a few important ones.
[rows]: the number of records generated in an execution step( Real data, unexpected)
Execute: the number of times an execution step has been executed( Real data, unexpected)
Stmt text: the description of the steps to be executed.
[estimaterows]: indicates how many rows of data are expected to be returned.

In this execution process table, I think the first three columns are more important for optimizing queries. As for the first two columns, I have explained above, and the meaning is very clear. The figures in the first two columns also roughly reflect the cost of those steps. For slow queries, you should pay attention to them【 Stmt text] will tell you what has been done in each step. For this kind of table, what it wants to express is actually a kind of tree information (a row represents a node in the graphic mode). Therefore, I suggest that we read them from the innermost layer. As an example, let me explain the execution process expressed in this table.

Line 5: [clustered index seek (object: ([mynorthwind]. [dbo]. [customers]. [PK]_ [customers]], seek: ([mynorthwind]. [dbo]. [customers]. [CustomerID] = [mynorthwind]. [dbo]. [orders]. [CustomerID]) ordered forward)], which means that SQL server is doing seek operation on the table customers according to the method of [clustered index seek], and the corresponding index is [PK]_ The value of seek comes from [orders]. [CustomerID]

Line 4: [clustered index scan (object: ([mynorthwind]. [dbo]. [orders]. [PK]_ Order], where: ([mynorthwind]. [dbo]. [orders]. [OrderDate] > =’2010-12-01 00:00:00.000 ‘and [mynorthwind]. [dbo]. [orders]. [OrderDate] <‘2011-12-01 00:00:00.000’)], which means that SQL server is doing scan on the table customers, that is, the worst way of [table scan], because there is no index on the OrderDate column, so it can only do so.

Line 3: [nested loops (left outer join, outer References: ([mynorthwind]. [dbo]. [orders]. [CustomerID])], which means that SQL server joins the data generated in line 5 and line 4 in the way of [nested loops], where the outer table is orders, and the matching operation to be joined is also indicated in line 5.

Line 2: [compute scalar (define: ([expr1006] = isnull ([mynorthwind]. [dbo]. [customers]. [customername], n ”)], which means to execute a call to isnull() function. For specific reasons, please refer to the view definition code given in the previous part of this article.

Line 1: [Select [v]. [OrderID], [v]. [CustomerID], [v]. [customername], [v]. [OrderDate], [v]. [summoney], [v]. [finished] from [ordersview] [v] where [v]. [OrderDate] > [email protected] And [v]. [OrderDate] < @ 2], usually the first line is the whole query, indicating its return value.

Index statistics: selection basis of query plan

We have been talking about the implementation plan. Since it is a plan, it means an operation plan that can be determined before the specific implementation. So how does SQL server choose an execution plan? How does SQL Server know when to use index or which index to use? For SQL server, whenever you want to execute a query, you must first check whether the execution plan of the query exists in the cache. If not, you need to generate an execution plan. When generating an execution plan, you do not see which indexes are available (random selection), but refer to a kind of data called index statistics. If you carefully look at the previous execution plan or execution process table, you will find that SQL server can estimate the amount of data generated by each step. Just because SQL server can estimate the amount of data, SQL server can choose the most appropriate method to execute the query process. At this time, index statistics can tell SQL server the information. Speaking of which, are you a little curious? In order to give you a perceptual understanding of index statistics, let’s see what index statistics looks like. Please enter the following statement in SQL Server Management Studio and execute it.


dbcc show_statistics (Products, IX_CategoryID)

The results are as follows:

First of all, explain the command: [DBCC show]_ Statistics] this command can display the index statistics we want to know. It needs two parameters: 1. Table name and 2. Index name

Let’s look at the result of the command. It consists of three tables:
1. The first table lists the main information of the index statistics.

Listing

Describes the name of the name statistic. Updated date and time when the statistics were last updated. The number of rows in the rows table. The number of rows sampled for the rows sampled statistic. How many groups can the steps data be divided into, corresponding to the third table. Selectivity of the first index column prefix of density (excluding EQ)_ ROWS)。 Average key length the average length of all index columns. If string index is “yes”, the statistics contain a string summary index to support estimating the size of the result set for the like condition. Only available for leading columns of char, varchar, nchar, and nvarchar, varchar (max), nvarchar (max), text, and ntext data types.

2. The second table lists the selectivity of various field combinations. The smaller the data is, the smaller the repeatability is. Of course, the higher the selectivity is.

Listing

Describes the selectivity (including EQ) of all density index column prefix set_ ROWS)。 Note: the smaller the value, the higher the selectivity.
If the value is less than 0.1, the selectivity of the index is relatively high, otherwise, the selectivity is not high. Average length the average length of the indexed column prefix set. Columns is the name of the index column prefix for which all density and average length are displayed.

3. The third table, histogram of data distribution, is used by SQL server to estimate the amount of data in some execution steps.

Listing

Description range_ HI_ The maximum value of key in each group. RANGE_ Rows the estimated number of rows for each data group, excluding the maximum value. EQ_ In order to help you better understand these data, especially the third group, please see the following figure:

When filling in the test data, I deliberately divided the categoryid into 1 to 8 (10 was added temporarily later), and each group was filled with 78 pieces of data. Therefore, the data in the third table of index statistics are correct. It is based on these statistics that SQL server can estimate the corresponding amount of data for each execution step, thus affecting the selection of join and so on. Of course, when selecting the join method, you should also refer to the selectivity of the fields in the second table. When SQL server generates an execution plan for a query, the query optimizer will use these statistics and relevant indexes to evaluate the cost of each scheme to select the best query plan.

Another example shows the importance of statistics for query planning. First, add more data. Please see the following code:

declare @newCategoryId int; insert into dbo.Categories (CategoryName) values(N'Test statistics'); set @newCategoryId = scope_ identity(); declare @count int; set @count = 0; while( @count < 100000 )begin      insert into Products (ProductName, CategoryID, Unit, UnitPrice, Quantity, Remark)       Values (cast (newid() as nvarchar (50)), @ newcategoryid, n 'number', 100, @ count + 1, n '');      set @count = @count + 1; endgoupdate statistics Products; go

Let’s look at the index statistics

Let’s look at the same query, but because the query parameter values are different, the execution plan selected by SQL server is as follows:

select p.ProductId, t.Quantity from Products as p left outer join [Order Details] as t on p.ProductId = t.ProductId where p.CategoryId = 26;     --  26 is the newly generated categoryid, so this query will return 10W records, select p.productid, t.quantity from products as P left outer join [order details] as t on p.productid = t.productid where p.categoryid = 6;     --  This query returns 95 records

As can be seen from the above figure, SQL server will choose completely different execution plans due to different parameter values of categoryid. The importance of statistical information is very clear here.

After the statistics are created, the database engine sorts the column values from which the statistics are created, and creates a “histogram” based on these values (up to 200, separated by intervals). The histogram specifies how many rows match each interval value exactly, how many rows are within the interval range, and the density of the median value of the interval or the incidence of duplicate values.

SQL Server 2005 introduces additional information collected for statistics created on char, varchar, varchar (max), nchar, nvarchar, nvarchar (max), text, and ntext columns. This information, called “string digest,” helps the query optimizer estimate the selectivity of query predicates in string patterns. When there are like conditions in the query, using string digest can more accurately estimate the size of the result set and continuously optimize the query plan. These conditions include conditions such as where lactam like ‘% bike’ and where name like ‘[CS] heryl’.

Since index statistics is so important, when will it be generated or updated? In fact, we don’t need to maintain index statistics manually. SQL server will maintain them automatically. There is also a parameter in SQL server to control the update mode

Working mode of statistical information automatic function

When an index is created, the query optimizer automatically stores statistics about the index columns. In addition, when auto_ CREATE_ When the statistics database option is set to on (the default), the database engine automatically creates statistics for columns that do not have indexes for predicates.

As the data in the column changes, the index and column statistics may be out of date, resulting in the query processing method selected by the query optimizer is not the best. For example, if you create a table that contains an index column and 1000 rows of data, and each row has a unique value in the index column, the query optimizer will regard the index column as a good way to collect query data. If there are many duplicate values after updating the data in the column, the column is no longer an ideal candidate for query. However, the query optimizer still considers the index as a good candidate based on the outdated distribution statistics (based on the data before updating).

When auto_ UPDATE_ When the statistics database option is set to on (the default), the query optimizer will automatically update the statistics periodically when the data in the table changes. Statistics update is initiated whenever the statistics used in the query execution plan fail the test for the current statistics. Sampling is done randomly on each data page and is taken from the minimum nonclustered index of the column required by the table or statistical information. After reading a data page from disk, all rows on the data page are used to update statistics. It’s normal to update statistics when about 20% of the data rows change. However, the query optimizer always ensures that the number of rows sampled is as small as possible. For tables less than 8 MB, a full scan is always performed to collect statistics.

Sampling data (instead of analyzing all data) can minimize the overhead of automatic updating of statistical information. In some cases, statistical sampling can not obtain the accurate characteristics of the data in the table. The sample clause and full scan clause of the update statistics statement can be used to control the amount of data sampled when the statistics are updated manually table by table. The full scan clause specifies to scan all data in the table to collect statistics, while the sample clause specifies the percentage of rows sampled or the number of rows sampled

In SQL Server 2005, the database option is auto_ UPDATE_ STATISTICS_ Async provides asynchronous update function of statistical information. When this option is set to on, the query can be compiled without waiting for statistics to be updated. The expired statistics are put in the queue and updated by the worker thread in the background process. Queries and any other concurrent queries are compiled immediately by using existing expiration statistics. Since there is no delay in waiting for updated statistics, the query response time is predictable; However, outdated statistics may cause the query optimizer to choose an inefficient query plan. Queries that start when the updated statistics are ready will use those statistics. This may result in a plan to recompile the cache (depending on the older version of the statistics). If some data definition language (DDL) statements (such as create, alter, and drop statements) appear in the same explicit user transaction, asynchronous statistics cannot be updated.

AUTO_ UPDATE_ STATISTICS_ The async option is set at the database level and determines the update method for all statistics in the database. It only applies to statistics updates, not to creating statistics asynchronously. Only auto_ UPDATE_ When statistics is set to on, set this option to on. By default, auto_ UPDATE_ STATISTICS_ The async option is set to off.

From the above description, we can see that for large tables, it is still possible that the statistical information is not updated in time, which may affect the judgment of the query optimizer.
Some people may have an experience: for some slow queries, they will think of rebuilding the index to try to solve them. In fact, it makes sense to do so. Because, in some cases, a query suddenly slows down, which may be related to the untimely update of statistical information, and then affect the judgment of query optimizer. If the index is rebuilt at this time, the query optimizer can know the latest data distribution, so this problem can be avoided. Do you remember the execution process table that I used to display with [set statistics profile on]? Note that the table shows the actual data volume and the estimated data volume of each step. In fact, we can use [set statistics profile on] to see if we want to rebuild the index. If the difference between the actual amount of data and the estimated amount of data is large, we can consider manually updating the statistical information, and then try again.

Optimize view query

Let’s talk about optimizing the view query. Although the view is also defined by a query statement and is essentially a query, it is different from the general query statement in optimization. The main difference here is that although the view is defined by a query statement, if we only analyze the query definition, we may get little significance, because the view is not used directly most of the time, but will add the where statement before use, or put it in other statements for use by the from clause. Let’s take an example. In my demo database, there is a view, ordersview, in front of the definition code. Let’s see what kind of implementation plan will come out if we use this view directly

From this view, it can be seen that SQL server will scan the table orders, which should be very inefficient. Let’s look at the following query:

It can be seen from this implementation plan that it is different from the one above. The previous query used the clustered index scan method to find the orders table. Now the clustered index seek method is used, and the cost percentage of the two rightmost steps has changed. This is enough to show that when optimizing the view, it is best to apply different filtering conditions according to the actual needs, and then decide how to optimize.

Let’s take a look at the execution plan of this view in the case of three queries.

select * from dbo.OrdersView where OrderId = 1;select * from dbo.OrdersView where CustomerId = 1;select * from dbo.OrdersView where OrderDate >= '2010-12-1' and OrderDate < '2011-12-1';

Obviously, for the same view, under different filtering conditions, the difference of execution plan is obvious.

Recommended reading – MSDN article

Index statistics
http://msdn.microsoft.com/zh-cn/library/ms190397(SQL.90).aspx

Query optimization suggestions
http://msdn.microsoft.com/zh-cn/library/ms188722(SQL.90).aspx

A list used to analyze slow queries
http://msdn.microsoft.com/zh-cn/library/ms177500(SQL.90).aspx

Logical and physical operator references
http://msdn.microsoft.com/zh-cn/library/ms191158(SQL.90).aspx