The article begins with the public number: the blackboard newspaper of preserved egg.
The author works in Jingdong and has a deep understanding of stability assurance, agile development, advanced JAVA and micro-service architecture.
Receiving R&D feedback, TCP retransmissions are serious. Host message retransmit is the most basic error recovery function of TCP. Its purpose is to prevent message loss.
There are many possible factors for message loss.
1. Network equipment or line failure
Case: CRC data validation errors often occur in device interfaces
Features: The problem persists, all data passing through the node is affected, affecting a large number of servers.
2. Link congestion caused by traffic bursts on data paths
Case Study: Packet Loss Caused by Full Private Line
Characteristic: It is very sudden and lasts for a short time. More often than not, it’s cyclical. All data passing through the node is affected, and the number of servers affected is large.
3. Client server failure
Case: A server network card failure, or performance degradation
Features: failure lasting for a long time, affecting only one device.
4. Server-side server failure
Case: A server network card failure
Features: failure persists for a long time, and all requests to the node are affected, affecting the number of servers.
5. Server-side performance degradation
Case study: When there are operational activities, the server requests too much, resulting in performance degradation
Characteristics: Sudden, if the server has a large number of requests, there will be periodicity, all requests to this device (cluster) data may be affected, affecting the number of servers.
6. Decreased performance of proxy nodes or VIPs
Case: A load balancing cluster failure or performance degradation
Characteristics: sudden, periodic. All data requests to this node are affected, affecting a large number of servers.
First, grab the package to generate pcap file, tcpdump-i nsdb475e5d-86-vvv-w tcp_retry.pcap. It is important to retain the evidence. At the same time, pay attention to whether the duty group and the network emergency response group have the same feedback. If there is feedback from others, confirm the affected area in time, whether the server has some commonalities, such as focusing on a data center, a POD, a physical machine.
Using the following commands, the number of TCP retransmitted messages per second in the system can be observed in real time. Online monitoring tools recommend using the tsar-Taobao System Activity Reporter produced by Ali.
nstat -z -t 1 | grep -e TcpExtTCPSynRetrans -e TcpRetransSegs -e TcpOutSegs -e TcpInSegs
Using netstat-s to view the overall situation, the statistical results are as follows according to the protocols.
Ss-anti | grep-B 1 retrans to see the traditional accounting situation, specific to the IP + port, where it is convenient to show the use of ss-tan demonstration
1. LISTEN status:
These two values represent the largest listen backlog backlog backlog value, shown here as 0, and actually take the value of the kernel parameter net. core. somaxconn
2. Other states:
(1) Recv-Q: Represents the network receiving queue, indicating that the received data has been buffered locally, but how much has not been removed by the process. If it is not zero for a short time, it may be in a semi-connected state. If the receiving queue Recv-Q has been blocked, it may be subjected to denial-of-service attacks.
(2) Send-Q: Represents the network sending queue, the other party did not receive data or Ack, or in the local buffer. If Send-Q can not quickly clear the sending queue, it may be that the application sent the data packet outward too fast, or the other party received the data packet not fast enough.
The non-LISTEN state should normally be zero. If not, it may be problematic, packets should not have a stack state in either queue, and transient non-zero situations are acceptable.
Ulimit-a checks the upper limit of file handles opened by the service. More than 100,000 normal files are sufficient
Check whether there are persistent drop and error phenomena in network card by ifconfig
The container is in good condition. Start using wiresherk to analyze the package file.
Check the IO graph to make sure that the link is not busy and that the link IO will have a lot of ups and downs, peaks and idle gaps.
Enter Analy -> Expert Info to view different levels of prompt information under different tags, such as retransmitted statistics, connection establishment and reset statistics.
Filter and retransmit, found concentrated on the communication ports of JSF, the two intranet service frameworks, 22000 and 22000
Guess it’s a service exception or communication exception of an upstream interface. Click on a note to see the details, or go back to the control panel, enter tcp. analysis. retransmission filter, and then click to see the details.
Most of them are retransmitted when DATA data is transmitted. The PSH ACK message indicates that data is started to be sent to the server.
You can see that many upstream interfaces and different dependency types (such as JMQ) have retransmission, which indicates that it is not an interface problem, but a network problem. Using MTR (integrating the functions of traceroute, Ping and nslookup) to check the delay and packet loss of the interconnected address on the path, we found that the loss rate of one hop in the middle was 16.7%. So we went to the colleagues of the network group to check it.
Supplement 1. Common Wiresherk operations
1. Statistics – > Conversations session statistics function, which counts the number of packets and bytes received and sent between communication sessions. Through this tool, we can find out which session (IP address or port number) occupies the most bandwidth in the network, and make further network strategy.
2. Statistics -> Flow graph session communication process graphical visualization, you can also see whether there are TCP delays including Delayed ACK, whether the server opens the Nagle algorithm.
Supplement 2. Common info tips from Wiresherk
1、Packet size limited during capture
It means that the marked bag is not fully grasped. Generally, it is caused by the capture mode. Some operating systems only catch the first 96 bytes of each frame by default.
2、TCP Previous segment not captured
If Wireshark finds that the Seq of the latter packet is larger than Seq+Len, he knows that there is a missing segment in the middle. If the missing segment is not found in the entire network packet (excluding disorder), he will prompt.
3、TCP ACKed unseen segment
When Wireshark finds that Ack’s bag has not been caught, it prompts him
When Wireshark finds that the Seq number of the latter package is less than the Seq + Len of the previous package, he will think that the order is out of order and give a prompt.
5、TCP Dup ACK
When disorder or packet loss occurs, the receiver receives some packets whose Seq number is larger than expected. Not receiving one of these packages will give Ack the expected Seq value at a time, presenting the sender.
6、TCP Fast Retransmission
When the sender receives three or more [TCP Dup ACK], it realizes that the previous package may have been lost and quickly retransmits it.
If a packet is lost and no subsequent packet can trigger [Dup Ack] at the receiver, it will not be retransmitted quickly. In this case, the sender has to wait until the timeout before retransmitting.
The “win” of the package represents the size of the receiving window. When Wireshark finds “win = 0” in a package, it sends a prompt.
9、TCP window Full
This prompt indicates that the sender of this package has exhausted the receiving window declared by the other party.
10、Time-to-live exceeded（Fragment reassembly time exceeded）
The author introduces: Liang Songhua, a senior engineer, focuses on stability assurance, agile development, JAVA advanced and micro service architecture for a long time.