Run well java process, how suddenly paralyzed

Time:2020-10-17

Memory recycling has always been a pain in Java

You can’t make products like redis with Java. Java Memory recycling mechanism makes us do not need to pay attention to the object recycling when we write code. At the same time, it increases the consumption of memory recycling. Mark copy needs to do memory copy, while mark clearing algorithm needs to stop the world. So when we use the cache, if the amount is a little larger, we need to help us with middleware like redis. As javaer, we enjoy the comfort of automatic memory recycling, but also need to know more about memory optimization methods.

Why can’t FGC stop

When will GC happen

In order to understand why our system does not stop FGC, we need to know when the system will GC. At the level of JVM, when we create a new object, the JVM will allocate the memory needed by the object in the heap. At this time, if there is not enough memory, GC is needed. The returned result of GC is the spatial address of the object. The JVM will first perform YGC, which is what we usually call tag replication. If there is still no space available after YGC, FGC will be performed. Similarly, if there is not enough space after FGC, FGC will be carried out circularly until enough space is applied.

Causes of non-stop FGC

As mentioned above, FGC can happen in every line of your code. If there is still not enough space after FGC, FGC will continue to apply until enough space is applied. At the same time, the JVM limits the proportion of VM time spent in GC before an OUTOFMEMORY error is thrown. There are five cases of frequent system f

  • Memory leak
  • Slow request processing results in too many threads simultaneously requesting memory
  • Metaspace exhausted
  • Constant pool fills up heap
  • Out of heap memory

In a highly concurrent system, most FGCS are caused by slower request processing. Assuming that a single machine bears TPS of 1W, the normal processing time of a request is 1ms, and the number of parallel requests at the same time is only 10. If the performance jitter occurs and the processing time of each request increases to 100ms, the number of parallel requests at the same time will increase to 100. Each thread will create some new objects when processing the request. The thread that survives for a long time will cause the effect of memory leakage, which will exhaust the memory of the system. At the same time, FGC will aggravate the cost of system performance, make the system become slower and produce avalanche.

How to make the system survive after FGC

Eliminate memory leaks

Memory leak causes and solutions, there are a lot of information on the Internet, here do not write. The frequency of system paralysis caused by memory leakage is very high. Some systems regularly pull configuration information from the database and cache it into the collection, but the set is written into a list by accident, and finally the memory overflows when adding new elements. Develop good programming habits, pay more attention to details, you can avoid many unknown problems.

Concurrency restriction: prevent the system from dying

Each server has an upper limit for parallel processing of requests. No matter how fast the request is processed, it will be held up after exceeding the upper limit. It is necessary to limit the number of concurrent requests for high concurrency to keep the system stable. It should be noted that some systems will also do some degradation logic when rejecting too many requests. The degradation logic also has performance overhead, and it also needs to do concurrency restriction. If the degraded request exceeds the concurrency limit, it will not degrade the logic and throw an exception directly.

Adaptive current limiting: prevent system from being touched

There are two reasons why we need adaptive current limiting

  • Each server is in a different environment

Some servers are mixed with offline computing VM, some are deployed on physical machines, some are deployed on new and old machines, and the QPS that each server can bear is not exactly the same. If the current limiting threshold value of each server in the distributed system is uniformly configured, either it can not play its due role, or some slow servers will be down in the case of high QPS, so it is most appropriate to use the server as the current limiting granularity.

  • If the correct current limiting threshold is set, it may be touched dead

When the QPS of a single machine is 6 ~ 20 times of the current limiting traffic, the cost of rejecting a request cannot be ignored. For example, in the Spring Festival Gala, some systems have set the right current limit, which is also destroyed by the flow rate of 6-20 times of the current limit. This method of death is called being touched to death. In response to this situation, what we can do is to dynamically reduce the current limiting threshold when the flow is 6-20 times larger. For example, the system initially accepts 1000qps and 5000 rejected traffic, which will kill the system. At this time, we adjust the threshold value of the system, set the current limit to 100, and the threshold value of being touched can be higher. In this way, even if there are 6000 requests coming in, our system can guarantee its survival.

Alibaba has a product that dynamically adjusts the single valve with the algorithm. It has been announced to the public. Interested students can search for the Noah adaptive current limiting official account in Grandpa’s technical public address.

Abnormal traffic monitoring: preventing long tail requests from dragging down the system

When we monitor the system, we usually pay attention to the 99th percentile data. However, if a reasonable current limit is set, the system is still hung up by the flow, so we should start with the long tail data of 1%. Some long tail data will have a great impact on the system. Imagine that if a put request transmits tens of megabytes of data, it is extremely unfriendly to Java. It is likely to generate FGC, slow down the request and cause a series of problems.

In a word, when our system is restarted again and again because of FGC, it is not like to take time to understand the causes of the performance problems of the system. Instead, we will pull out the needle that causes the problem, sleep soundly at night, and dig new pits with more vitality during the day. I hope every programmer has a stable system.

Author information:Tongmu, GitHub account number zhdd99, senior development engineer of Alibaba infrastructure business department, currently mainly responsible for Alibaba IDC monitoring system.


Author: Tongmu

Read the original

This article is the original content of yunqi community, which can not be reproduced without permission.

Recommended Today

On the theoretical basis of SRE

What is SRE? When I first got into contact with SRE, many people thought that it was a post with full stack capability in Google and could solve many problems independently. After in-depth exploration, it is found that SRE can solve many problems, but there are too many problems. It is difficult for a post […]