The springboot program starts very slowly, and analyzes the process in detail

Time:2020-10-9

Problem description

A common springboot program, native and test environment are OK. After going online to the company’s cloud environment, it has been safe and sound.

Today, I restart it. I found that the restart time is not right. It is very slow. Now my palms are sweating a little. This program is mostly written by me, and it is also online.

Fortunately, it has not been promoted in the whole line. I quickly calmed down and looked back at the previous startup log. Only then did I find that the startup was very slow, but I didn’t pay attention to him when I started it.

problem analysis

When starting, there will be some business logic, which will be executed successively through @ order. After looking at the development environment and test environment of the machine, the startup is very fast, and the info log printing of each line is within one second.

The logs in the cloud environment are very abnormal, especially the log printing of a certain line is stuck for a long time, which takes 82 seconds.

The scope of problem positioning has been narrowed down, that is, there are problems here.

This logic is about 100 lines of code: the main job is to query the database several times, have some nested loops, and put some data into memory. That’s it.

Problem orientation

Because the code, database data and operating system of each environment are basically similar, there is no clue at the moment.

What’s more troublesome, as we all know, online faults are not easy to locate and check breakpoints. It’s just that there is no problem with the test environment. We can’t debug idea, and we can’t add log to check online. We are most afraid of this kind of fault.

Analyze them one by one.

Nested loops cause?

First of all, I wonder if nested loops are exponentially increased or dead loops. So the data in the database is consistent, so the loop must be the same. Since there is no dead loop in the test environment, the cloud environment will certainly not be dead loop.

Network induced?

I eliminated the loop problem, and then I focused on suspecting the connection to the database.
If the environment is different, the network must be different. Is there a problem in the connection between the application host and the database host in the cloud environment? For more than a minute?
It was tested right away, and it wasn’t.

Poor host performance?

I then wondered if the performance of the host was too poor? However, the idea of a negative, how can the online environment be worse than the test environment.

Then I verified that the online configuration is still very good.

SQL causes?

Then, I can only doubt whether the SQL query is too slow. Although the data are the same, this is unlikely. However, there is not much SQL in this code. I took it out and tried it one by one. Who knows that the first SQL will be stuck! It took more than a minute to run in Navicat.

The culprit has been found. This is the SQL.

select * from td_b_noderelation a where a.exroadnodeid not in (
    select enroadnodeid from td_b_noderelation b 
    where b.version=a.version
);

In fact, with so many years of work experience, 99% of the reasons for the slowdown are SQL. I should have doubted it for a long time.
But then again, data is the same, how can an environment be slow and an environment fast?

Implementation plan?

SQL is slow. Of course, the first thing I think about is the execution plan.

explain select * from td_b_noderelation a where a.exroadnodeid not in (
    select enroadnodeid from td_b_noderelation b 
    where b.version=a.version
);
  • Cloud environment
id    select_type    table    type    possible_keys    key    key_len    ref    rows    Extra
1    PRIMARY    a    ALL                    17598    Using where
2    DEPENDENT SUBQUERY    b    ALL                    17598    Using where
id    select_type    table    partitions    type    possible_keys    key    key_len    ref    rows    filtered    Extra
1    SIMPLE    a        ALL                    17382    100    
1    SIMPLE    <subquery2>        eq_ref    <auto_distinct_key>    <auto_distinct_key>    93    billing.a.exRoadNodeId,billing.a.version    1    100    Using where; Not exists
2    MATERIALIZED    b        index    PRIMARY    PRIMARY    141        17382    100    Using index

Strange, the same data, how to use the cloud environmentALLWhat about it?

Explain – select_type

select_ Type: the type of each select clause in the query

  • (1) Simple (simple select, no union or subquery, etc.)
  • (2) Primary (the outermost query in the subquery. If the query contains any complex sub parts, the outermost select is marked as primary)
  • (3) Union (the second or subsequent select statement in a union)
  • (4) Dependent Union (the second or subsequent select statement in the union, depending on the external query)
  • (5) Union result (the result of union, all the selections after the beginning of the second select in the union statement)
  • (6) Subquery (first select in subquery, result independent of external query)
  • (7) Dependent subquery (first select in subquery, dependent on external query)
  • (8) Derived (sub query of select, from clause of derived table)
  • (9) Uncacheable subquery (the result of a subquery cannot be cached, the first row of the outer link must be reevaluated)

Explain – type

Type: table access mode, which indicates the way MySQL finds the required row in the table, also known as “access type”.
Common types are: all, index, range, ref, Eq_ Ref, const, system, null (from left to right, performance from poor to good)

  • All: full table scan, MySQL will traverse the whole table to find matching rows
  • Index: full index scan. The difference between index and all is that index type only traverses the index tree
  • Range: retrieve only the rows in the given range, and select the rows with one index
  • Ref: represents the join matching criteria of the above table, that is, which columns or constants are used to find values on index columns
  • Eq_ Ref: similar to ref, the difference is that the index used is a unique index. For each index key value, there is only one record matching in the table. In short, primary key or unique key is used as the association condition in multi table join
  • Const, system: when MySQL optimizes a part of the query and converts it to a constant, these types of access are used. If the primary key is placed in the where list, MySQL can convert the query into a constant. System is a special case of const type. When the query table has only one row, use system
  • Null: MySQL decomposes statements in the optimization process, and does not even need to access the table or index when executing. For example, selecting the minimum value from an index column can be completed through a single index lookup.

Database version problem?

When I was puzzled, I thought that the cloud environment was built by someone before it went online, while the test environment was built last year. Would the database versions of the two environments be different at all?

Do it now!

  • Cloud version

    • select @@version
    • 5.6.30-log
  • Native / test environment

    • select version()
    • 8.0.18

I’m dizzy, how did you get a 5.6 so backward version? If you want to be stable, you should use at least 5.7.

Let’s take a look at the history of these two versions

  • 5.6.30 is in 2016
  • 8.0.18 October 2019

PS: the latest is 8.0.19 in January 2020

With such a long history in 2016, could it not have beeneq_refThis type? This is my first doubt. However, if you look at the MSYQL official website, it is not true that this type has existed for a long time.

Not in?

Can that be an optimizer problem? 5.6 wasn’t that smart?
Before 5.5, association query was always better than subquery, but after 5.5, it was not.
In 5.6 and 8.0, I knownot inIn most casesnot existsAlmost or better, but I tried onenot exists

explain select * from td_b_noderelation a where not exists (
    select 1 from td_b_noderelation b 
    where b.version=a.version and a.exRoadNodeId=b.enRoadNodeId)

However, there has been no change in the implementation plan. It is proved that the current version is different from the 5.5 version at that time, and the optimizer will optimize in most casesnot inandnot existsYes.

What is the problem?

I rearranged my thinking and looked back at the previous execution plan. I found that I missed a very important point. The type of the test environment isindex

It’s not the same as creating a table statement, is it?
With this question, Ishow create tablesee.

CREATE TABLE `td_b_noderelation` (
  `enRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL ,
  `exRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL ,
  `miles` decimal(12,0) DEFAULT NULL ,
  `version` varchar(13) COLLATE utf8_bin NOT NULL ,
  `validTime` varchar(19) COLLATE utf8_bin DEFAULT NULL ,
  PRIMARY KEY (`enRoadNodeId`,`exRoadNodeId`,`version`),
  KEY `idx_td_b_noderelation_01` (`version`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin 
CREATE TABLE `td_b_noderelation` (
  `enRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL ,
  `exRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL ,
  `miles` decimal(12,0) DEFAULT NULL ,
  `version` varchar(13) COLLATE utf8_bin NOT NULL ,
  `validTime` varchar(19) COLLATE utf8_bin DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin 

I’m dizzy. It’s the person who built the table that forgot to create the primary key and index. It’s too bad.
With the establishment of the primary key, the problem can be solved easily.

last

doubt

This table is only 80000 data, small table, no index search is not slow. Why is it so slow?

The reason is that the autocorrelation itself is very small, Descartes has hundreds of millions, how can it not be slow.

summary

Although this table is a small table, it is rarely checked (all put into memory), so it can not be found in general. But the database standard and working attitude is really important. Fortunately, it’s a startup problem. If it happens in the calculation of fees or transactions, it’s over.

My own level is also limited, can not see the problem at the beginning, analysis for half a day.

Through this summary, we should be alert to future generations.

Danni 3 was published in segment fault. Please indicate the source of reprint.
https://juejin.im/post/5eaeb1…

Written at home on May 3, 2020 after working overtime at night

Recommended Today

The course of using Chinese software of poedit Pro

Poedit pro (formerly known as poedit) is a free (professional version charge), open source and cross platform gettext class (. Po format, gettext is used in the application program for program internationalization) International Translation editor. It is also one of the most widely used software of the same type. At present, it is available in […]