How to use DPA combined with ZABBIX to analyze and locate the tempdb database explosion problem of SQL Server

Time:2021-2-23

 

This article shares how to use ZABBIX and DPA tools (solarwinds database performance analyzer) to analyze and locate the problem of tempdb database size explosion of SQL server. Personal experience, there is no perfect monitoring tool, the so-called ruler has advantages, inch has short. The monitoring scheme should not rely on one tool. It is better to match and combine multiple monitoring tools to complement each other. Multi aspect and multi-level monitoring.

 

As shown in the following, ZABBIX monitoring sends out an alarm, disk space alarm and database unable to allocate a new page alarm, as shown in the following

 

G:: Disk space is low (used > 90%)

 

SQL Instance MSSQLSERVER: Could not allocate a new page

 

clip_image002[4]

 

 

As shown in the following screenshot, you can see that disk g (which stores tempdb’s data files separately) has increased by more than 100 g in just one hour.

 

clip_image003[4]

 

Note: in fact, it starts at about 2:55 and ends near 4:00, mainly in the period of 3:00-4:00, so we will mainly analyze the data in this period.

 

 

When ZABBIX gives an alarm, DPA also gives an alarm, but it indicates that the performance of this time period is seriously degraded, the cumulative waiting time for events is increasing, and the indicator changes from green to red. But at that time, I was dealing with another problem. I didn’t notice the alarm information and missed the best time. When I found the problem, I couldn’t capture the Problem SQL. However, DPA tool captured the relevant SQL and some index data.

 

First of all, we analyze the index data of this time period and do the drill down analysis. We find that the disk read-write delay is very high in this time period, as shown in the following screenshot

 

clip_image004[5]

 

 

Add analysis indicators such as total I / O wait time, as shown below. By default, only part of the indicator data is displayed. If you need to observe some other indicator data, you need to switch. As shown in the following screenshot

 

clip_image005[5]

 

 

As shown below, the total I / O wait time is also absurdly high.

 

clip_image006[4]

 

 

In fact, through the link comparison and year-on-year analysis, we will find that under normal circumstances, the read-write delay of the database is basically less than 10 ms. One of the screenshots is as follows:

 

clip_image007[4]

 

 

Slicing by time, we observed that the following two SQL statements (SQL hash values 5444075766 and 3439056218) had the most waiting time between 3:00 and 4:00, as shown in the following screenshot:

 

clip_image008[4]

 

After drilling analysis, we exclude the SQL with hash value of 5444075766 and focus on the analysis of SQL The SQL statement with hash value of 3439056218 is shown in the following screenshot. It is found that there are a large number of waiting events in this SQL statement (as shown below). From the analysis of these waiting events, the main ones are memory / CPU waiting events (later analysis will let you know why memory / CPU waiting events account for the majority) and preemptive_ OS_ Fileops wait events. Based on experience, it is estimated that this SQL is very likely to be the culprit, because there are a large number of preemptive wait events and preemptive wait events_ OS_ The waiting time of fileops is very high. Generally, there are a lot of preemptive_ OS_ When fileops waits for events, a large number of IO operations are likely to occur, which is most likely due to the allocation and occupation of tempdb space. Of course, this is not absolute. It can’t be used as causal reasoning.

 

PREEMPTIVE_OS_GETDISKFREESPACE

 

 PREEMPTIVE_OS_FILEOPS

 

 PREEMPTIVE_OS_WRITEFILEGATHER

 

 Memory/CPU

 

 

 

clip_image009[4]

 

PREEMPTIVE_ OS_ Fileops wait event, which occurs when a thread calls one of several windows functions related to the file system. This wait type is a general wait. Then we drill down from the dimension of “waits” and find that the waiting event is mainly SQL statements with SQL hash value of 3439056218. Other statements produce this wait event. This phenomenon also increases the accuracy of our preliminary judgment.

 

clip_image010[4]

 

From the analysis of other dimensions (files), it is found that the SQL statement with the SQL hash value of 3439056218

 

clip_image011[4]

 

 

clip_image012[4]

 

 

 After careful analysis, it is found that the SQL uses cross join to associate a table and a view, but there is no association condition between the two in the where statement, that is to say, the SQL becomes a Cartesian product. In addition, there are sorting operations in SQL, so this SQL consumes a lot of tempdb space. Because of Cartesian product, memory / CPU waiting is a big head phenomenon. So far, the analysis is over. Forward analysis and reverse analysis confirm that this SQL is the cause of tempdb explosion.