This code makes the program execution efficiency 200 times, it is worth seeing!

Time:2021-4-30

You are the most handsome!

Source code has been included GitHubView the source code

preface

A few days ago, the business system department complained about our data platform. Because during working hours, the business system can’t find the data it wants. This problem can be big or small, but after all, it affects the normal operation of the business. All technologies serve the business. Therefore, no matter how difficult the technology is, it must be rectified. At the same time, as the “craftsman spirit” of the Internet, We should not only make the function run normally, but also make the function run in the optimal state.

System introduction

The whole system can be divided into three parts

  1. Business system: there are many business systems in the upstream. The operation of the business system produces a lot of data, which are scattered in many databases, most of which are MySQL databases
  2. Data intelligence platform: Data intelligence platform belongs to the middle platform system, which mainly provides powerful data support services for business systems, and the lower layer is connected to the data warehouse.
  3. Data warehouse: the data warehouse manages all the data in a unified and centralized way. The data warehouse processes, extracts and transforms the data generated by the business system to the data warehouse for storage on a daily basis.

At the end of the day, each business system produces a large amount of data, which is processed and extracted by the timing task and stored in the data warehouse. When you are still sleeping in the middle of the night, these timing tasks are running silently.
This code makes the program execution efficiency 200 times, it is worth seeing!

The data processed every day is usually required to be processed before working hours, and then it can be queried and called by the business system through the query system of the data intelligence platform. This time, the data is not queried because the data has not been processed at 10 a.m. the next day. The next step is to find problems and optimize them, because normally, no matter how long the timing task chain is, it won’t slow down until 10 o’clock the next day, and the data hasn’t come out yet. The following is to find the problem, and then optimize.

Task optimization

Through the task log, it is found that the data extraction execution time of an upstream system is 3 hours, while the data volume is only 1 million. Of course, this alone can not determine whether the task can be optimized.

Looking at the task code, the logic is relatively simple: there is an original data table to record the commodity information and the defined classification (this is fictitious, but the actual situation is more complicated. I simplify it here and then convert it for easy understanding). The target table of the data warehouse is to store the classification and commodity in different tables, and the general structure is as follows.

This code makes the program execution efficiency 200 times, it is worth seeing!

So why is it necessary to make such a transformation? This is because the whole large system, generally speaking, can only define some basic specifications, but the specific specifications can not be constrained, such as the ID card field name of system a is card_ No, and the ID card field name of system B is crdt_ No (this should happen frequently); For another example, when dealing with entity relations, the processing methods are also different. For a 1-to-1 relationship, two tables can be created for association, or one table can be stored, which leads to the disunity of multiple systems. This situation is inevitable, because from the perspective of business systems, it ensures the normal operation of the system.

When the data warehouse processes multiple raw data, it needs to consider the compatibility problem, so the conversion process as shown in the figure above will appear.

The reason for this task to be executed for 3 hours is that a record in the original table will be converted into three tables in the data warehouse table, and these three tables are associated by ID. the whole code flow is as follows.

This code makes the program execution efficiency 200 times, it is worth seeing!

However, the problem came. After running for 3 hours with 1 million data, I started to try to optimize the execution process of the program, starting from the following points

  1. The classification has been fixed in the system and will not change. Caching can reduce the number of queries to the database
  2. Read more data from the original table each time, from the original 500 / time to 2000 / time

After optimization, the efficiency has improved a little, but it is not very obvious (some students may want to ask, these are very basic, why do you do them in the first place? Cough… Well, for historical reasons, there may not be much data at the beginning. No matter how it is executed, there is little difference. For example, the execution time of 10 minutes and 20 minutes seems to be twice as efficient. However, since it does not affect the business system and has been running normally, there is no problem.

Here, the data needs to be associated, so we need to insert the data and get the self growing ID of this record, and then insert it into the associated table, and the table structure is basically impossible to move (if the table structure moves, it really affects the whole body. The next day we must be called to have tea).
This code makes the program execution efficiency 200 times, it is worth seeing!

So let’s first analyze why the implementation here is so slow.

  1. If the original table contains 1 million data, 2000 queries can be made each time, so the total number of queries is 1000000 / 2000 = 500, which certainly does not take much time. There is basically no room for optimization. Even if all the queries are found at one time, only 499 times of query time can be saved (it is impossible to query so much data at one time)
  2. Query 2000 pieces of data, data conversion, and then successively inserted into the information table and association table, here is a parsing execution, a total of 4000 times inserted into the database, there is no doubt that this is the most time-consuming. Data conversion is necessary, and it is operated in memory, so it does not take too much time; So the rest is a total of 1 million * 2 database insertion times. Can we optimize it?

The first thought is batch insert, batch insert can effectively reduce the number of database access. But we can’t do batch insertion here because we need to get the self growth ID, so we feel that we are in trouble.
This code makes the program execution efficiency 200 times, it is worth seeing!
After last night’s exercise that night, I put aside my worries and felt comfortable.
This code makes the program execution efficiency 200 times, it is worth seeing!
All of a sudden, the self growth ID of the database is a value controlled by the database, and we know the self growth step size. For example, if the self growth step size is 1 and the current self growth ID is 1, then you can be sure that the self growth ID of the next record is 2.

Can I insert a record and get the self growth ID? Then I can calculate the self growth ID of all the data in the future. I don’t need to get every record from the growth ID.

But there is also a problem, that is, in the process of data transformation and import, no other program can insert data into the table, otherwise the self growing ID calculated by the program will not match. And this problem does not exist at all, because the data of the data warehouse is calculated and inserted by the original table. At the same time, there is no other task to write this table, so we can do it safely and boldly.

I Abstract This part of logic into a demo and fill in 1 million data. The core code before optimization is as follows:

private void exportSource(){
    List<Source> sources;
    //Refresh date. Here, the attribute is used as the date. In fact, it's better to pass the local variable as a parameter. Forgive me for being lazy
    date = new Date();
    int pageNum = 1;
    do{
        sources = sourceService.selectList(pageNum++, pageSize);
        System.out.println(sources);
        for (Source source : sources) {
            //Data conversion
            Target transfer = transfer(source);
            //Insert data and return self growing ID
            targetService.insert(transfer);
            TargetCategory targetCategory = new TargetCategory();
            Category category = allCategory.get(source.getCategoryName());
            if(category != null){
                targetCategory.setCategoryId(category.getId());
            }
            targetCategory.setTargetId(transfer.getId());
            //Insert classification data
            targetCategoryService.insert(targetCategory);
        }
    }while(sources.size() > 0);
}

Not to mention the efficiency. I ran for an hour, almost ran 200000 data (the total running time is expected to be more than 5 hours), and then I didn’t continue to run. I optimized it on this basis.

private void exportSourcev2(){
    List<Source> sources;
    //Refresh date. Here, the attribute is used as the date. In fact, it's better to pass the local variable as a parameter. Forgive me for being lazy
    date = new Date();
    int pageNum = 1;
    Integer startId = 0;
    do{
        sources = sourceService.selectList(pageNum++, pageSize);
        List<Target> sourceList = new ArrayList();
        List<TargetCategory> targetCategoryList = new ArrayList();
        for (Source source : sources) {
            //Data conversion
            Target transfer = transfer(source);
            //For the first time, take the growth ID, and then calculate it directly
            if(startId == 0){
                //Insert data and return self growing ID
                targetService.insert(transfer);
                startId = transfer.getId();
            }else{
                startId++;
                sourceList.add(transfer);
            }
            TargetCategory targetCategory = new TargetCategory();
            Category category = allCategory.get(source.getCategoryName());
            if(category != null){
                targetCategory.setCategoryId(category.getId());
            }
            targetCategory.setTargetId(transfer.getId());
            targetCategoryList.add(targetCategory);
        }
        if(sourceList.size() > 0){
            targetService.insertBatch(sourceList);
        }
        if(targetCategoryList.size() > 0){
            targetCategoryService.insertBatch(targetCategoryList);
        }
    }while(sources.size() > 0);
}

This code makes the program execution efficiency 200 times, it is worth seeing!
From the test results, the execution time has been greatly reduced, from at least 5 hours to less than 12 minutes.

This code makes the program execution efficiency 200 times, it is worth seeing!

Only 11 minutes, how can we be satisfied, not enough!!!!

Well, the poor blogger will continue to use his brain to find a way to satisfy you. In fact, you can feel it from the speed of SQL printing. At the beginning, SQL was a brush print. At the end, SQL was a brush… Brush… The brush printing feels like a car running out of oil. Start from this place to see if it can be optimized.

public List<Source> selectList(Integer pageNum, Integer pageSize) {
    //Paging query
    PageHelper.startPage(pageNum,pageSize);
    List<Source> sources = sourceMapper.selectList();
    return sources;
}
//Printed query SQL
==>  Preparing: SELECT * FROM source LIMIT ?, ?
==> Parameters: 998000(Long), 2000(Integer)

Let’s run this SQL in navcat to see the time
This code makes the program execution efficiency 200 times, it is worth seeing!
It took one second to query one record. Let’s look at another query.
This code makes the program execution efficiency 200 times, it is worth seeing!
It takes 70ms, which is quite normal. Because limit query will find out all the data of offset, and then discard the data before offset, that is, full table retrieval, so the efficiency is low. The query efficiency can be optimized by constructing the condition of where ID.

    @Select(" SELECT  t.* " +
            " FROM    ( " +
            "        SELECT  id " +
            "        FROM    source " +
            "        ORDER BY " +
            "                id " +
            "        LIMIT #{offset}, #{size} " +
            "        ) q " +
            "JOIN    source t " +
            "ON      t.id = q.id")
    List<Source> selectList(PageParam page);

This code makes the program execution efficiency 200 times, it is worth seeing!
In this way, the query efficiency is optimized.
This code makes the program execution efficiency 200 times, it is worth seeing!
From the measured results, the overall efficiency is more than doubled, from 12 minutes to 5 minutes.

summary

This paper improves the efficiency of program execution by batch insertion and optimizing the efficiency of paging query. Welcome to learn and correct.