A common springboot program, native and test environment are OK. After going online to the company’s cloud environment, it has been safe and sound.
Today, I restart it. I found that the restart time is not right. It is very slow. Now my palms are sweating a little. This program is mostly written by me, and it is also online.
Fortunately, it has not been promoted in the whole line. I quickly calmed down and looked back at the previous startup log. Only then did I find that the startup was very slow, but I didn’t pay attention to him when I started it.
When starting, there will be some business logic, which will be executed successively through @ order. After looking at the development environment and test environment of the machine, the startup is very fast, and the info log printing of each line is within one second.
The logs in the cloud environment are very abnormal, especially the log printing of a certain line is stuck for a long time, which takes 82 seconds.
The scope of problem positioning has been narrowed down, that is, there are problems here.
This logic is about 100 lines of code: the main job is to query the database several times, have some nested loops, and put some data into memory. That’s it.
Because the code, database data and operating system of each environment are basically similar, there is no clue at the moment.
What’s more troublesome, as we all know, online faults are not easy to locate and check breakpoints. It’s just that there is no problem with the test environment. We can’t debug idea, and we can’t add log to check online. We are most afraid of this kind of fault.
Analyze them one by one.
Nested loops cause?
First of all, I wonder if nested loops are exponentially increased or dead loops. So the data in the database is consistent, so the loop must be the same. Since there is no dead loop in the test environment, the cloud environment will certainly not be dead loop.
I eliminated the loop problem, and then I focused on suspecting the connection to the database.
If the environment is different, the network must be different. Is there a problem in the connection between the application host and the database host in the cloud environment? For more than a minute?
It was tested right away, and it wasn’t.
Poor host performance?
I then wondered if the performance of the host was too poor? However, the idea of a negative, how can the online environment be worse than the test environment.
Then I verified that the online configuration is still very good.
Then, I can only doubt whether the SQL query is too slow. Although the data are the same, this is unlikely. However, there is not much SQL in this code. I took it out and tried it one by one. Who knows that the first SQL will be stuck! It took more than a minute to run in Navicat.
The culprit has been found. This is the SQL.
select * from td_b_noderelation a where a.exroadnodeid not in ( select enroadnodeid from td_b_noderelation b where b.version=a.version );
In fact, with so many years of work experience, 99% of the reasons for the slowdown are SQL. I should have doubted it for a long time.
But then again, data is the same, how can an environment be slow and an environment fast?
SQL is slow. Of course, the first thing I think about is the execution plan.
explain select * from td_b_noderelation a where a.exroadnodeid not in ( select enroadnodeid from td_b_noderelation b where b.version=a.version );
- Cloud environment
id select_type table type possible_keys key key_len ref rows Extra 1 PRIMARY a ALL 17598 Using where 2 DEPENDENT SUBQUERY b ALL 17598 Using where
id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 SIMPLE a ALL 17382 100 1 SIMPLE <subquery2> eq_ref <auto_distinct_key> <auto_distinct_key> 93 billing.a.exRoadNodeId,billing.a.version 1 100 Using where; Not exists 2 MATERIALIZED b index PRIMARY PRIMARY 141 17382 100 Using index
Strange, the same data, how to use the cloud environment
ALLWhat about it?
Explain – select_type
select_ Type: the type of each select clause in the query
- (1) Simple (simple select, no union or subquery, etc.)
- (2) Primary (the outermost query in the subquery. If the query contains any complex sub parts, the outermost select is marked as primary)
- (3) Union (the second or subsequent select statement in a union)
- (4) Dependent Union (the second or subsequent select statement in the union, depending on the external query)
- (5) Union result (the result of union, all the selections after the beginning of the second select in the union statement)
- (6) Subquery (first select in subquery, result independent of external query)
- (7) Dependent subquery (first select in subquery, dependent on external query)
- (8) Derived (sub query of select, from clause of derived table)
- (9) Uncacheable subquery (the result of a subquery cannot be cached, the first row of the outer link must be reevaluated)
Explain – type
Type: table access mode, which indicates the way MySQL finds the required row in the table, also known as “access type”.
Common types are: all, index, range, ref, Eq_ Ref, const, system, null (from left to right, performance from poor to good)
- All: full table scan, MySQL will traverse the whole table to find matching rows
- Index: full index scan. The difference between index and all is that index type only traverses the index tree
- Range: retrieve only the rows in the given range, and select the rows with one index
- Ref: represents the join matching criteria of the above table, that is, which columns or constants are used to find values on index columns
- Eq_ Ref: similar to ref, the difference is that the index used is a unique index. For each index key value, there is only one record matching in the table. In short, primary key or unique key is used as the association condition in multi table join
- Const, system: when MySQL optimizes a part of the query and converts it to a constant, these types of access are used. If the primary key is placed in the where list, MySQL can convert the query into a constant. System is a special case of const type. When the query table has only one row, use system
- Null: MySQL decomposes statements in the optimization process, and does not even need to access the table or index when executing. For example, selecting the minimum value from an index column can be completed through a single index lookup.
Database version problem?
When I was puzzled, I thought that the cloud environment was built by someone before it went online, while the test environment was built last year. Would the database versions of the two environments be different at all?
Do it now!
Native / test environment
I’m dizzy, how did you get a 5.6 so backward version? If you want to be stable, you should use at least 5.7.
Let’s take a look at the history of these two versions
- 5.6.30 is in 2016
- 8.0.18 October 2019
PS: the latest is 8.0.19 in January 2020
With such a long history in 2016, could it not have been
eq_refThis type? This is my first doubt. However, if you look at the MSYQL official website, it is not true that this type has existed for a long time.
Can that be an optimizer problem? 5.6 wasn’t that smart?
Before 5.5, association query was always better than subquery, but after 5.5, it was not.
In 5.6 and 8.0, I know
not inIn most cases
not existsAlmost or better, but I tried one
explain select * from td_b_noderelation a where not exists ( select 1 from td_b_noderelation b where b.version=a.version and a.exRoadNodeId=b.enRoadNodeId)
However, there has been no change in the implementation plan. It is proved that the current version is different from the 5.5 version at that time, and the optimizer will optimize in most cases
What is the problem?
I rearranged my thinking and looked back at the previous execution plan. I found that I missed a very important point. The type of the test environment is
It’s not the same as creating a table statement, is it?
With this question, I
show create tablesee.
CREATE TABLE `td_b_noderelation` ( `enRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL , `exRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL , `miles` decimal(12,0) DEFAULT NULL , `version` varchar(13) COLLATE utf8_bin NOT NULL , `validTime` varchar(19) COLLATE utf8_bin DEFAULT NULL , PRIMARY KEY (`enRoadNodeId`,`exRoadNodeId`,`version`), KEY `idx_td_b_noderelation_01` (`version`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
CREATE TABLE `td_b_noderelation` ( `enRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL , `exRoadNodeId` varchar(16) COLLATE utf8_bin NOT NULL , `miles` decimal(12,0) DEFAULT NULL , `version` varchar(13) COLLATE utf8_bin NOT NULL , `validTime` varchar(19) COLLATE utf8_bin DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I’m dizzy. It’s the person who built the table that forgot to create the primary key and index. It’s too bad.
With the establishment of the primary key, the problem can be solved easily.
This table is only 80000 data, small table, no index search is not slow. Why is it so slow?
The reason is that the autocorrelation itself is very small, Descartes has hundreds of millions, how can it not be slow.
Although this table is a small table, it is rarely checked (all put into memory), so it can not be found in general. But the database standard and working attitude is really important. Fortunately, it’s a startup problem. If it happens in the calculation of fees or transactions, it’s over.
My own level is also limited, can not see the problem at the beginning, analysis for half a day.
Through this summary, we should be alert to future generations.
Danni 3 was published in segment fault. Please indicate the source of reprint.
Written at home on May 3, 2020 after working overtime at night