The reason and solution of abnormal memory of .Net program

Time:2022-9-20

1. Summary

Probably in March of this year, I was suddenly transferred to another project team to solve the problem of abnormal online memory. After two weeks of hard work, I finally solved this problem. Here I will share my thoughts and ideas with you. I hope it can help you or some friends who are currently encountering such a thing and provide some ideas.

Second, the scene

When the head of the department found me, he described such a passage to me.

"At present, the service has a problem of submitting memory exceptions. At present, it is analyzed that there may be a large number of log messages accumulated in the log component, which fills the memory and causes the service to crash. There are 15,000 IoT devices on the server of a customer in a certain region in China. This is not working properly. The problem is very urgent and needs to be resolved immediately."

Problem described so far, no other information available. At this time, I crashed first… but the task found you can't say no. In case of solving such a major accident, I can show it in front of the department boss.

3. Ideas

(1) Analysis

Part1, analyze the cause of log accumulation

  • Get the server address to dig out the log file and check the log content; the content is basically some error cases where the xxx object is null, and the object conversion fails.
  • The implementation of the log component is also poor, and the Log object will be re-new in each calling class

solution:

  • Fix the problem that the object is null and add a null value judgment. The probably reason is that the incoming value is null when the json value is converted, which will cause a chain reaction of these two pieces. It is worth noting that try blocks are usually added to catch exceptions where json objects are converted. The capture of try in the program will affect the performance of the .net program, so it can be avoided by judgment. Try not to trigger the try mechanism, program performance If it is dragged down, other aspects of the processing will be reduced in disguise and the processing speed will be slower, so the data accumulation seems to be explained.
  • The log component is refactored into a singleton and thread-safe implementation. The data structure written to the log is class. Here it is changed to struct. The factor to consider is that the reference type will have reference problems, and then the value type and reference type to be considered are in memory. The size occupied is different, and value types and reference types are faster in processing speed.

Do you think this is the end? No, when the program was modified and put it on the test server to run the next morning, the lady in the test department found me and said that the abnormal error report was fine, but the memory leak was still unsolved.

Part2, find the root cause of memory leaks

It seems that the operation of Part1 is just to fix a small bug, which is not as simple as I thought. In the log view, I also found that "tcp service refused to connect XXX exception" in the log log. I was in a bad mood when I saw this….

1. Early in the morning, I ran the service program with Profile and found it

(1) There are several message queues that take up a lot of space. After checking the code, it is found that all the data that the server program will interact with 15,000 IoT devices will accumulate in this queue first. If the queue is full (the upper limit of the Queue is set to 2w ) will create a new Queue and then transfer the overflowed part to the new Queue. The most terrible thing is that the data is taken from the queue or single-threaded processing.

(2) There will also be many disk I/O operations that will be stored on the application server, such as socket communication packets and content that needs to be forwarded, etc., will be written.
(3) During the step-by-step debugging, it is found that most of the method implementations are synchronous methods, and the framework version is actually .net freamwork4.

solution:

(1)

[Remove the mechanism of new new queue, delete the upper limit setting of Main Queue and change it to multi-threaded processing Queue; the essence of all data accumulation is that the data cannot be processed, so opening up more memory space is just a chronic death. 】
[Visit the IoT hardware department and ask IoT devices to send data frequency, the number of devices, and the size of a single piece of data sent by a single device in KB; why do you need to know? These first points are recorded in the program and then counted into trend charts, which can directly observe the changes in the queue. When meeting, it can give leaders convincing evidence to see when the amount of data increases sharply, the size of the data, etc.; the second point is because The packet data needs to be stored locally on the application server. At this time, it can be calculated whether the amount of written data exceeds the write I/O bottleneck of ordinary hard disks and the occupation of network bandwidth. 】
[Visit the IoT hardware department 2 and ask if the IoT device socket has followed the normal "tcp wave" process when transmitting data; why? Because socket tcp communication is a duplex channel, then one end of the channel is suddenly disconnected, and the other end will enter the "wait" state and will not recycle the tcp connection resources in time. Let's imagine that if 15,000 devices are connected with high frequency and short connections, then the server will operate. Connection queue resources are likely to be overwhelmed. At this time, the server needs to actively disconnect the "invalid" connection and recycle resources in time to "remove the duplex channel" and adjust the size of the socket connection queue. 】

(2) To write the message information to the disk, use a three-inch tongue to persuade the project manager to cut it off to save CPU performance and reduce disk I/O. Everyone, imagine sending and receiving each socket communication. It is a terrible thing to have to operate the I/O every time. In the end, the project manager of the group agreed to cut off the disk write function of some modules, so the question is how to do the rest and how to further the advantage expand? At this time, continue to check the project code, and it turns out that "receive" and "send" in socket communication will operate once. Then what needs to be done at this time is to accumulate the packets to a certain number. For example, if 1000 packets are accumulated and then written at one time, the operation frequency of disk I/O will decrease exponentially.

(3) The last question is to change all methods to asynchronous methods. At this time, Task, Async, and Await can be sacrificed. But the framework is based on .net freamwork4. Later, I went to consult the MSDN documentation and found that there are still these features in the ancient framework of .net freamwork4. Although the usage is a little uncomfortable, it can still be optimized. It must be remembered that the development server must have a "server" thinking. If all methods are synchronous, they will be blocked synchronously and are in a "waiting for processing result" state. In this way, the concurrency of the server will not increase.

Although there is not much use here, I will share with you the "Annotation Dafa"; comment out the most likely places to go wrong and check one by one to find the problem, which is very time-consuming. Working for 12 hours, especially the company's ancient projects are usually "bad code", "basically no design", "the version of the .net framework used is low", etc. A bunch of disgusting things happen.

(2) Tools

  • Profile that comes with Visual Studio. [Can analyze the occupancy of CPU, memory, etc.; this is recommended]
  • VMMap [can analyze CPU, memory, etc. occupancy]
  • ANTS Performance Profiler [This tool is more powerful and can analyze the call link to tell you where the memory is occupied and the size of the memory occupancy step by step]
  • The resource monitor that comes with the Window operating system goes without saying that everyone will use it.

Part3, summary

Based on the above modifications, the memory is stable at about 2.9G after 3 weeks of stable operation on the test server;

Be sure to remember:

  • "Don't complain about anything difficult."
  • "A good software engineer is hired to solve problems, not create them;"
  • "As for the arrangement of tasks, masters always state the deadline for solving the problem; hand in things at the time. Instead of hesitating to say clearly and shrinking."
  • "Think calmly when you encounter problems, and believe that you can do it; even if you fail, try it out."
  • "Don't say anything when you don't solve the problem, it's like looking for a reason. Keep your mouth shut and think of a solution."

In fact, a lot of interesting stories happened during the period of solving this problem, but in the end, we still have to solve difficult problems to prove ourselves. Development and learning itself is a process of continuous improvement. Colleagues who despise poor technology always maintain an apprentice's heart.

Part4, Easter eggs

After solving this problem, the prestige in the eyes of colleagues in the same department will be improved (especially the little sisters in the testing department, because they don't have to go to the server every day), and the boss of the major accident department that finally solved the project gave the opportunity to transfer to other The provincial R&D center will increase the salary of the project manager by 10% on the basis of the translation. It can be seen how cost-effective it is to master the skills of first-hand first aid.

The above is the detailed content of the causes and solutions of .Net program memory exceptions. For more information about .Net program memory exceptions, please pay attention to other related articles on developpaer!