Ma Yingying, web front end development engineer of Netease smart enterprise
In a perfect instant messaging application, websocket is a key link. It provides a full duplex communication mechanism for the client and server of web application. However, due to the instability of its own and TCP connection, developers have to design a complete set of live protection, live test and reconnection scheme for it, so as to ensure the application in practical application Timeliness and high availability. As far as reconnection is concerned, its speed has seriously affected the “immediacy” and user experience of upper level applications. If wechat can’t send and receive messages one minute after opening the network, is it crazy?
Therefore, how to quickly restore the availability of websocket when the network changes becomes particularly important.
Quick understanding of websocet
Websocket was born in 2008 and became an international standard in 2011. Now all browsers support it. It is a new application layer protocol. It is a real full duplex communication protocol specially designed for web client and server,
You can understand websocket protocol by analogy with HTTP protocol. Their differences are as follows:
- The protocol identifier of HTTP is HTTP, and that of websocket is WS
- HTTP requests can only be initiated by the client, and the server cannot actively push messages to the client, but websocket can
- HTTP requests have the same origin limit, and the communication between different sources needs to cross domain, but websocket has no homology restriction
- They are all application layer communication protocols
- The default port is the same, either 80 or 443
- Can be used for communication between browser and server
- Both are based on TCP protocol
The relationship between the two and TCP
Disassembly of reconnection process
First consider the question, when do you need to reconnection?
The easiest thing to think of is that the websocket connection is broken. In order to send and receive messages, we need to initiate another connection. However, in many scenarios, even if the websocket connection is not disconnected, it is actually unavailable, such as device switching network, link intermediate route crash, server load continuously too high to respond, etc. the websocket in these scenarios is not disconnected, but for the upper layer, there is no way to send and receive data normally. Therefore, before reconnection, we need a mechanism to sense whether the connection is available or not, whether the service is available, and to be able to quickly sense, so that we can quickly recover from the unavailable state.
Once you feel that the connection is not available, you can discard the old connection, discard it and disconnect it, and then initiate a new connection. These two steps seem simple, but if you want to achieve fast, and not so easy.
The first is to disconnect the old connection. For the client, how to quickly and quickly disconnect? According to the protocol, the client must negotiate with the server to disconnect the websocket. But when the client can’t contact the server and negotiate, how to disconnect and recover quickly?
The second is to quickly initiate new connections. This fast is not that fast. The fast one here is not to initiate a connection immediately, which will have an unpredictable impact on the server. When reconnecting, some backoff algorithms are usually used, and the reconnection is initiated after a period of delay. But how to make a trade-off between reconnection interval and performance consumption? How to quickly initiate a connection at the “right point in time”?
With these questions, let’s take a closer look at these three processes.
Fast sensing when reconnection is needed
The scenarios that need to be reconnected can be divided into three types: first, the connection is disconnected; second, the connection is not broken but unavailable; third, the service of the opposite end is not available.
The first scenario is very simple. The connection is directly disconnected and must be reconnected.
For the latter two, whether the connection is unavailable or the service is unavailable, the impact on the upper application is that instant messaging can no longer be sent and received. Therefore, from this perspective, a simple and crude way to perceive when to reconnection is to send a heartbeat packet. If a heartbeat packet is sent, and the server does not receive a packet back after a certain period of time, it is considered as a service This method is the most direct. If you want itRapid perceptionIt can only send more heartbeat packets to speed up the heart rate. However, if the heartbeat is too fast, it will consume too much traffic and power on the mobile terminal. Therefore, this method can not achieve fast sensing, and can be used as a cover mechanism for detecting connections and services.
If you want to detect the unavailability of a connection, in addition to heartbeat detection, you can also determine the network status. Because network disconnection, WiFi switching, and network switching are the most direct reasons for the unavailability of the connection, when the network state changes from offline to online, you need to reconnection in most cases, but not necessarily, because the underlying layer of websocket is based on TCP, and TCP connection is not sensitive Sharp perception of the application layer network changes, so sometimes even if the network is disconnected for a short time, the websocket connection will not be affected. After the network is restored, it can still communicate normally. Therefore, when the network is disconnected from the connection, the next connection can be judged immediately by sending a heartbeat packet. If the heartbeat packet from the server can be received normally, the connection is still available. If the heartbeat back packet is not received after the waiting time-out, the connection needs to be reconnected, as shown on the right side of the above figure. The advantage of this method is fast, it can sense whether the connection is available at the first time after the network is restored. If it is not available, the recovery can be performed quickly. However, it can only cover the situation that websocket is not available due to the change of application layer network.
To sum up, the scheme of regularly sending heartbeat packet detection is stable, which can cover all scenarios, but the speed is not too good; the scheme for judging network status is fast, without waiting for heartbeat interval, which is more sensitive, but the coverage scenario is limited. Therefore, we can combine two schemes: regularly send heartbeat packets at a slow frequency, such as 40s / time, 60s / time, etc., which can be determined according to the application scenario. Then, when the network state changes from offline to online, the heartbeat will be sent immediately to detect whether the current connection is available, and recover immediately if it is not available. In this way, in most cases, the application communication of the upper layer can recover quickly from the unavailable state. For a small number of scenarios, there is a timed heartbeat as the background, which can be recovered in a heartbeat cycle.
Quick disconnect old connection
Usually, before initiating the next connection, if the old connection still exists, the old connection should be disconnected first. In this way, the resources of the client and the server can be released, and the data can be sent and received from the old connection by mistake.
We know that the bottom layer of websocket transfers data based on TCP protocol, and the two ends of the connection are the server and the client respectively, and the time of TCP_ The wait state is maintained by the server side, so in most normal cases, the server should initiate the disconnection of the underlying TCP connection, not the client. In other words, to disconnect the websocket, if the server receives the instruction to disconnect the websocket, it should immediately initiate to disconnect the TCP connection; if the client receives the instruction to disconnect the websocket, it should send a signal to the server, and then wait for the underlying TCP connection to be disconnected by the server or until the timeout occurs.
If the client wants to disconnect the old websocket, it can be discussed in two cases: the websocket connection is available or not. When the old connection is available, the client can directly send the disconnection signal to the server, and then the server can initiate the disconnection; when the old connection is not available, for example, the client switches WiFi, and the client sends the disconnection signal, but the server cannot receive it. The client can only wait until the timeout before it is allowed to disconnect. The process of time-out disconnection is relatively long. Is there any way to quickly disconnect?
The upper layer application can not change the protocol level rule that the server can only initiate the disconnection, so it can only start from the application logic. For example, the upper layer can guarantee the complete failure of the old connection through the business logic, simulate the disconnection of the connection, and then initiate a new connection to restore communication. This method is equivalent to trying to disconnect the old connection. If it is not possible to disconnect the old connection, you can directly discard it, and then you can quickly enter the next process. Therefore, when you use it, you must ensure that the old connection has completely failed in terms of business logic. For example, if you lose all the data received from the old connection, the old connection cannot hinder the establishment of the new connection, and the old connection cannot affect the new connection and the upper business logic wait.
Initiate new connections quickly
Students with IM development experience should understand that when reconnecting due to network reasons, it is absolutely impossible to initiate a new connection immediately. Otherwise, when there is network jitter, all devices will immediately connect to the server at the same time. This is no different from a denial of service attack caused by a hacker who consumes network bandwidth by launching a large number of requests It’s a disaster. Therefore, when reconnecting, some backoff algorithms are usually used to initiate the reconnection after a period of delay, as shown in the flow chart on the left.
What if you want to connect quickly? The most direct way is to shorten the interval between retries. The shorter the interval is, the faster the communication can be restored after the network is restored. However, too frequent retrying will cause serious consumption of performance, bandwidth and power. How to make a better balance between them?
A more reasonable way is to gradually increase the retry interval as the number of retries increases; on the other hand, when the network state changes from offline to online, which is more likely to be reconnected, the reconnection interval can be appropriately reduced, as shown on the right side of the above figure (the reconnection interval will also become larger with the increase of the number of retries).
In addition, it is also possible to adjust the interval according to the possibility of successful reconnection in combination with the business logic. For example, when the network is not connected or is applied in the background, the reconnection interval can be adjusted a little more, and so on, so as to speed up the reconnection.
At the beginning of this paper, we subdivide websocket disconnection and reconnection into three steps: determining when reconnection is needed, disconnecting the old connection and initiating a new connection. Then it analyzes how to quickly complete the three steps in different states of websocket and different network states: firstly, it detects whether the current connection is available by sending heartbeat packets regularly, and at the same time, it monitors network recovery events, sends a heartbeat immediately after recovery, quickly perceives the current state and determines whether it needs to be reconnected; secondly, under normal conditions, the server sends a heartbeat to quickly sense the current state and judge whether it needs to be reconnected When the server disconnects the old connection, it discards the old connection when it loses contact with the server, and the upper layer simulates the disconnection to realize the fast disconnection. Finally, when a new connection is initiated, the backoff algorithm is used to delay the connection for a period of time. Meanwhile, considering the waste of resources and the speed of reconnection, the reconnection interval can be increased when the network is offline, and the reconnection is reduced when the network is normal or the network changes from offline to online Interval so that it can be reconnected as quickly as possible.
Learn about Netease cloud information, communication and video cloud services from Netease Core Architecture > > and
More technical dry cargo, welcome to VX official account.“Netease smart enterprise technology +”。 If you watch the series of courses in advance, you can get excellent gifts free of charge. You can also talk to CTO directly.
Listen to Netease CTO about frontier observation, see the most valuable technology dry goods, learn the latest practical experience of Netease. Netease smart enterprise technology +, accompany you from thinker to technical expert.