MySQL uses sum, case and when to optimize statistical queries

Time:2021-12-28

Recently, I worked on a project in the company, which involved the tasks related to the development of statistical reports. Due to the relatively large amount of data, it took about ten seconds for the query statement written before to query 500000 pieces of data. Later, with the guidance of the boss, I used sum, case when… The performance of rewriting SQL is improved to one second. Here, in order to explain the problems and solutions concisely and clearly, I will simplify the requirements model.

Now the database has an order table (simplified intermediate table). The table structure is as follows:

CREATE TABLE `statistic_order` (
 `oid` bigint(20) NOT NULL,
 `o_ Source ` varchar (25) default null comment 'source number',
 `o_ Actno ` varchar (30) default null comment 'activity number',
 `o_ Actname ` varchar (100) default null comment 'participating activity name',
 `o_ n_ Channel ` int (2) default null comment 'mall platform',
 `o_ Clue ` varchar (25) default null comment 'clue classification',
 `o_ star_ Level ` varchar (25) default null comment 'order star',
 `o_ Saledep ` varchar (30) default null comment 'marketing department',
 `o_ Style ` varchar (30) default null comment 'vehicle type',
 `o_ Status ` int (2) default null comment 'order status',
 `syctime_ Day ` varchar (15) default null comment 'format date by day',
 PRIMARY KEY (`oid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

The project requirements are as follows:

Count the number of source numbers per day within a certain period of time, where the source number corresponds to o in the data table_ Source field. The field values may be CDE, SDE, PDE, CSE, SSE.

Source classification flows over time

At the beginning, I wrote such a paragraph of SQL:


select S.syctime_day,
 (select count(*) from statistic_order SS where SS.syctime_day = S.syctime_day and SS.o_source = 'CDE') as 'CDE',
 (select count(*) from statistic_order SS where SS.syctime_day = S.syctime_day and SS.o_source = 'CDE') as 'SDE',
 (select count(*) from statistic_order SS where SS.syctime_day = S.syctime_day and SS.o_source = 'CDE') as 'PDE',
 (select count(*) from statistic_order SS where SS.syctime_day = S.syctime_day and SS.o_source = 'CDE') as 'CSE',
 (select count(*) from statistic_order SS where SS.syctime_day = S.syctime_day and SS.o_source = 'CDE') as 'SSE'
 from statistic_order S where S.syctime_day > '2016-05-01' and S.syctime_day < '2016-08-01' 
 GROUP BY S.syctime_day order by S.syctime_day asc;

This method uses the sub query method. Without adding an index, 550000 pieces of data execute this SQL sentence, wait for nearly ten minutes under the workbench, and finally report a connection interruption. You can see the SQL execution plan through the explain interpreter as follows:

Each query has a full table scan. The five sub queries, dependent subquery, indicate that it depends on external queries. This query mechanism is to first perform external queries to find the date results after group by, and then sub queries respectively query the number of CDE, SDE, etc. in the corresponding date. Its efficiency can be imagined.

In o_ Source and syctime_ After adding the index to the day, the efficiency is greatly improved. The query results are found in about five seconds:

Viewing the execution plan, it is found that the number of rows scanned has been greatly reduced, and full table scanning is no longer performed:

Of course, this is not fast enough. If the amount of data reaches the level of millions, the query speed must be intolerable. I’ve been wondering whether there is a way to directly traverse once to query all the results, which is similar to traversing the list set in Java. If you encounter a certain condition, you can count once. In this way, you can query the result set and index the results after a full table scan. The efficiency should be very high. Under the guidance of the boss, use sum aggregate function and case when… then… This “strange” usage effectively solves this problem.
The details are as follows:


 select S.syctime_day,
 sum(case when S.o_source = 'CDE' then 1 else 0 end) as 'CDE',
 sum(case when S.o_source = 'SDE' then 1 else 0 end) as 'SDE',
 sum(case when S.o_source = 'PDE' then 1 else 0 end) as 'PDE',
 sum(case when S.o_source = 'CSE' then 1 else 0 end) as 'CSE',
 sum(case when S.o_source = 'SSE' then 1 else 0 end) as 'SSE'
 from statistic_order S where S.syctime_day > '2015-05-01' and S.syctime_day < '2016-08-01' 
 GROUP BY S.syctime_day order by S.syctime_day asc;

About case in MySQL when… The usage of then will not be explained too much. This SQL is easy to understand. First, traverse one record by one, group by classifies dates, and sum aggregate function sums the values of a date, focusing on case when… Then skillfully added conditions to sum when o_ When source = ‘CDE’, the count is 1, otherwise it is 0; When o_ When source =’sde ‘

The execution of this statement only took more than one second. It is ideal to make statistics of such a dimension for more than 500000 data.

Through the execution plan, it is found that although the number of rows scanned has increased, only one full table scan has been performed, and it is a simple query, so the execution efficiency is naturally higher:

For this problem, if you have a better plan or idea, please leave a message

summary

This is the end of this article about MySQL optimizing statistical queries with sum, case and when. For more information about MySQL optimizing statistical queries, please search for previous articles of developeppaer or continue to browse the following articles. I hope you will support developeppaer in the future!