The interaction between TCP delayed ACK and Nagle algorithm causes transaction processing delay, which is one of the most common typical network failures. Not long ago, after the operation and maintenance department of a financial institution upgraded the payment platform program, it experienced the problem of slow transaction caused by the interaction between delayed ACK and Nagle algorithm. Let’s take a look at this case by using the network fault analysis method.
After the operation and maintenance personnel of this financial institution completed the program upgrade of the payment platform, they found that the business processing capacity of the new system has not been significantly improved: in the peak period of payment, business transactions have slowed down, and the transaction volume has been unable to go up, seriously affecting the user’s online consumption and payment experience. After checking the network and application configuration repeatedly, no obvious abnormality was found. Finally, we decided to capture data packets for analysis.
After getting the data packet, the operation and maintenance personnel opened it with the analysis tool, and found no problems such as packet loss retransmission and slow server response. However, by sorting the TCP time intervals, we find some abnormal phenomena: there are many delays of about 200ms in the interval, and there is almost no ack response from payload data. It is the first time that we think that these delays may be caused by the interaction between delayed ACK and Nagle algorithm.
The Dali tool based on data flow is used to open the data packet, set the time interval, and display the trigger over 190ms with red mark. Looking at the data interaction session flow, it is found that the server has more than 200 ms delay when replying to ACK, which greatly reduces the efficiency of data transmission and affects the speed of business processing.
After analyzing each 200ms delay, it is found that all the delayed acks have no payload of payload, and all of them confirm the same packet after 200ms delay.
It can be seen from the above phenomenon that the ACK without payload delays about 200ms, and these delayed acks only confirm the request of one packet. When opening other sessions, we find the same phenomenon, and finally determine that these delays are caused by the interaction between TCP delayed ACK and Nagle algorithm.
Delayed ack is a flow control method of TCP. If the response data is sent, the ACK will be sent to the other party together with the response data; if there is no response data, the ACK will be delayed to wait to see whether the response data is sent together.
Nagle algorithm is a control method to prevent network congestion by reducing the number of packets < MSS in network connection. There can only be one unconfirmed packet (packet size < MSS) at any time. In short, when the sender wants to send a packet dissatisfied with MSS, all the packets sent in front of it need to be confirmed by ack. Because of delayed ACK, the receiver delays the response of ACK, resulting in both sender and receiver waiting, which causes what we call ack delay.
Delayed ack + Nagle algorithm
Both sides will not wait indefinitely, up to 500ms. Generally, the timer started by the system kernel has a default delay of 200ms. Cycle timing from 0 to 200, and check whether there is an ACK to be sent every 200ms, so the ACK delay may be any value within 200ms.
solve the problem
Nagle algorithm is only suitable for specific scenes. There is a saying that Nagle algorithm introduces unnecessary delay, which is currently applicable to most scenarios, such as this case. After the payment platform was upgraded, the Nagle algorithm was turned on by default, resulting in ack delay and business impact. After turning off the Nagle algorithm, the problem is solved.
Solution of automatic service path monitoring
To solve this kind of problem, we need to understand TCP congestion control and TCP / IP related principle knowledge, the process of investigation and analysis is relatively complex. Of course, through the way of service path monitoring, automatic continuous monitoring can be carried out. When problems occur, automatic analysis and alarm can be achieved to improve operation and maintenance efficiency and ensure business stability.
Tiandan network performance management NPM products simplify the complex network performance management. NPM uses network packets and ipfix stream data to establish monitoring views covering important links, key equipment and core service paths of data center. The complex network analysis technology is injected into the core technology of the product, and the service path combing of months is shortened to weeks. At the same time, based on mature data acquisition and processing technology, it can fully collect network data, refresh key network monitoring indicators in real time, accurately alarm, and monitor, audit and backtrack the network behavior of key business, so as to improve the security guarantee ability of business system.
Copyright notice: This is the original article of CSDN blogger “Tiandan netis”, which is in accordance with CC 4.0 by-sa copyright agreement. Please attach the link of the original source and this notice.
Link to the original text: https://blog.csdn.net/Netis20…