At about 4:00 p.m. one day, I received a message from the partner. In the online environment, an HTTP service maintained by me suddenly timed out a lot (the timeout time of the other party is set to 300ms). I quickly went to the eagle eye platform to start sampling. I found that the average QPS of the service reached about 120, the average RT was more than 2 seconds to 3 seconds, and some burrs were as high as 5 to 6 seconds (about 60ms in normal times).
The service is an internal operation platform service (only two dockers are deployed) with expected QPS single digits. It has not been published online recently. The core operation is to integrate the query database. One request involves about 40 dB queries at most. The final query result is a multi-layer tree structure with a response body of about 50K. Before, I made an oral agreement with the caller to cache. Now I see that the QPS is about 120 (the QPS proves that there is no cache), so I asked the caller to cache and reduce the QPS. After that, the QPS is reduced to less than 80, RT returns to normal (an average of 60ms), and finally the QPS is reduced to 40 (later, it is necessary to promote the cache on the caller to ensure that the QPS is in single digits).
Since the core operation of the service is to query the database, and there are 40 dB queries requested at a time, first check whether slow SQL causes it. Check the DB performance monitoring. It is found that the average RT of DB is within 0.3ms. It can be calculated that the overall time consumption of DB is about 12ms. Excluding slow SQL causes the RT to become higher.
I began to wonder whether the DB connection pool was queued under high parallel delivery. The default connection pool size of tddl was 10. After checking and monitoring, the number of connections occupied was never more than 7. The problem of insufficient connection pool was ruled out.
So far, the cause of high RT has been excluded at the database level.
Then start to check each execution point on the sampled service call chain to see which part of the call chain takes the most time. It is found that many execution points have a feature that the local call takes a long time (hundreds of milliseconds), but the real service call (such as DB query action) takes a short time (0ms means the execution time is less than 1ms, which indirectly confirms that the average RT of DB is less than 0.3ms)
Local call time: 267ms
Client sending request: 0ms
Server processing request: 0ms
Response received by client: 1ms
Total time: 1ms
At this time, the problem becomes clear. The problem is that the local method takes too long to execute, but check all the codes of the service again. There is no long-time local execution logic, so continue to look at the CPU load.
The load has been hovering around 4 for a long time. Our docker is deployed on the 4c8g host, but we can’t monopolize the 4C. It’s not normal to maintain such a high load.
Continue to track down the reasons for the high CPU load, then look at the GC log and find a large number of allocation failures. Then parnew times are more than 100 times per minute, which is obviously abnormal. See the GC log example below
2020-03-25T16:16:18.390+0800: 1294233.934: [GC (Allocation Failure) 2020-03-25T16:16:18.391+0800: 1294233.935: [ParNew: 1770060K->25950K(1922432K), 0.0317141 secs] 2105763K->361653K(4019584K), 0.0323010 secs] [Times: user=0.12 sys=0.00, real=0.04 secs] Copy code
The CPU time is about 0.04s each time, but because parnew GC is too frequent, it can occupy more than 100 times per minute, and the overall CPU time is still considerable.
Look at the JVM memory parameters
Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 4294967296 (4096.0MB) NewSize = 2147483648 (2048.0MB) MaxNewSize = 2147483648 (2048.0MB) OldSize = 2147483648 (2048.0MB) NewRatio = 2 SurvivorRatio = 10 MetaspaceSize = 268435456 (256.0MB) CompressedClassSpaceSize = 1073741824 (1024.0MB) MaxMetaspaceSize = 536870912 (512.0MB) G1HeapRegionSize = 0 (0.0MB) Copy code
The younger generation allocates 2G of memory, of which the Eden area is about 1.7g
Use jmap to view the space occupied by younger generation objects. The top ranked objects are objects under multiple org.apache.tomcat.util.buf packages, such as bytechunk, charchunk, messagebytes, and some temporary object lists involved in the response. Among them, bytechunk, etc. are Tomcat response output related classes
So far, the problem is clear. In the process of sending the super large response packet (50K) to the network card, it needs to be copied from the user space in the user state to the kernel space in the kernel state, and then sent to the network card (zero copy such as netty is for this kind of copy). In addition, a large number of temporary object lists are involved in the response body query process. In high concurrency scenarios, As a result, the memory of the younger generation is full, and then the interface is frequently GC (the interface will be tested later at an appropriate time). Here is another point. Many people think that parnewgc will not stop the world, but it will. Frequent parnewgc causes the user thread to enter the blocking state and give up the CPU time slice, resulting in connection processing waiting and high RT of the interface. Throughout the troubleshooting process, visual monitoring platforms such as eagle eye and IDB performance monitoring really help a lot, otherwise you will be confused to check the log everywhere.
- In interface design, it is necessary to avoid the occurrence of super large responders, divide and conquer, and split a large interface into multiple small interfaces.
- Cache design. Like this service, if a request brings about nearly 40 dB queries, it needs to consider caching on the server (I was lazy at that time and asked the caller to cache).
- For performance design, you should know the performance of your own system like the back of your hand. You can get the ceiling of your own system by means of pressure measurement. Otherwise, if an interface is hung, it will lead to problems in the availability of the whole application.
- Traffic isolation. Traffic isolation is required between internal applications and external traffic. Even if it passes through the cache, there is a problem of cache breakdown.
- What you say orally is unreliable. If you want to fall on the document, you still need to check the implementation.
After reading three things ❤️
If you think this article is very helpful to you, I’d like to invite you to help me with three small things:
Praise, forwarding, and your “praise and comments” are the driving force for my creation.
Pay attention to the official account.Rotten pig skin“And share original knowledge from time to time.
At the same time, we can look forward to the follow-up articles
Author: Eden baby