Introduction: when the number of users of the product keeps doubling, the demand will force you to optimize the HTTP protocol. So, in order to optimize HTTP performance to the limit, which dimensions should we start from? In this paper, TVP Tao Hui will share four new dimensions for you. “TVP” column, condense the thinking of big guys, gather experts to share, harvest new ideas, welcome to long-term attention. (Editor: Yunjia community tail)
About the author:Tao Hui, Tencent cloud’s most valuable expert (TVP), CTO and co-founder of Hangzhou zhilianda Data Co., Ltd., once worked in Alibaba cloud, Tencent, Huawei, Cisco and other companies, and wrote the best-selling book “in-depth understanding of nginx: module development and architecture analysis”, and cooperated with geek time in the best-selling Video Courses “web protocol explanation and real battle of capturing package” and “100 lectures on nginx core knowledge”.
No matter you are doing front-end, back-end or operation and maintenance, HTTP is a network protocol that you have to deal with. It is the most commonly used application layer protocol. The optimization of it can not only bring better experience by reducing delay, but also bring higher concurrency by reducing resource consumption.
However, it is difficult for those who have just learned HTTP to fully describe all the optimization points of HTTP protocol. When you want to prepare for an interview with a large factory or join a fast-growing project, it is necessary for you to understand this aspect. Because when the number of users of the product continues to double, the demand will force you to optimize the HTTP protocol.
This paper is the summary of Mr. Tao Hui’s speech at GOPs global operation and Maintenance Conference Shanghai station in 2019. I hope it can cover most of HTTP optimization skills from four new dimensions. In this way, even if you don’t need the ultimate method to solve the current performance bottleneck, you can know where the optimization direction is, and when the demand comes, you can go to Google to view the data.
1、 Coding efficiency optimization
The first dimension is to transform messages into shorter character streams more quickly in terms of coding efficiency. This is the most direct performance optimization point.
If you have done packet capturing analysis on HTTP / 1.1 protocol, you will find that it is encoded in the way of “whitespace delimited”. Using space and carriage return to code is because HTTP pursues readability at the beginning of its birth, which is more conducive to its promotion.
However, at present, this encoding method has seriously affected the performance, so in 2009 Google launched the binary based spdy protocol, which greatly improved the encoding efficiency. In 2015, after a little improvement, it was determined to be the HTTP / 2 protocol, and now more than 50% of the sites are using it.
This is the general direction of coding optimization, including the upcoming http / 3.
But how do these new technologies improve performance? We need to take a look at the data compression first.
What the packet grabber sees is data, which is not equal to information. Data is actually the sum of information and redundant data, and compression technology is to remove redundant data as much as possible.
Compression is divided into lossless compression and lossy compression. For pictures, audio and video, we deal with lossy compression every day. For example, when the browser only needs thumbnails, there is no need to waste bandwidth to transmit high-definition pictures.
However, after lossy compression, HD video has been compressed thousands of times when it cannot be distinguished by the naked eye. This is because both sound and video can be incrementally compressed.
Remember the VCD? When the disc is scratched, the whole disc can’t be played. That’s because the video at that time was incrementally compressed, and there were too few keyframes, resulting in the damage of keyframes, and the subsequent incremental frames can’t be played.
Look at lossless compression. You must have used gzip, which enables HTTP body to achieve lossless compression. After reading the compressed messages with naked eyes, they are all garbled, but after decompressing at the receiving end, you can see the original text at the sending end. However, the efficiency of gzip is not very high. Compared with the brotli launched by Google, you can see its defects:
When evaluating compression algorithm, we focus on two indicators: compression rate and compression speed. As you can see in the above figure, no matter which of the 9 compression levels is gzip, its compression rate is lower than brotli (compared with gzip, the compression level can also be configured as 10), and the compression speed is slower.
So, if you can, you should update your gzip compression algorithm as soon as possible.
With the compression of body, let’s see the compression of HTTP header. For HTTP / 1. X, header is a performance killer. Especially in the current era of cookie flooding, each request must carry a few kilobytes of header, which is a waste of bandwidth, CPU, memory!
Through hpack technology, http2 greatly reduces the volume of header after coding, which is also the evolution direction of http3. How does hpack implement header compression?
Hpack compresses three kinds of headers through Huffman algorithm, static table and dynamic table. For example, in the figure above, method get exists in a static table, which can be expressed as an integer 2 represented by one byte. The header of user agent Mozilla is very long. When it appears for the second time, it can be expressed as an integer 62 represented by two bytes. Even when it appears for the first time, Huffman algorithm can be used to compress Mozilla’s very long browser identifier to obtain a compression rate of up to 5 / 8.
Only the most common headers are stored in static tables, some only have name, some include name and value at the same time. Static tables are very limited in size, with only 61 elements at present.
The idea of incremental coding is applied to dynamic table, that is, adding dynamic table when it appears for the first time, and transferring its serial number in dynamic table when it appears for the second time.
Huffman coding is widely used in WinRAR and other compression software, but Huffman in hpack is different. It uses static Huffman coding.
It counts the HTTP headers on the Internet for several years, and rebuilds the Huffman tree according to the probability of each character. In this way, according to the rules, the characters a, C, e or 1, 2, 3 that appear most often are represented by five bit bits, while the characters that rarely appear are represented by dozens of bit bits.
With the header finished, let’s look at the coding of HTTP body. Here are three examples:
First, there is no need to transmit a small icon with tens of bytes in an independent HTTP request. According to rfc2397, it can be directly embedded in HTML or CSS files, and browsers will recognize them when parsing, just like the image in the following figure:
Second: there may be many small files in the JS source file, and there are many blank lines and comments in these files. Through the webpack tool, first package them into a file on the server side, and remove redundant characters. The coding effect is also good.
Third: in a form, you can transfer multiple elements at once, such as check boxes or files. This reduces the number of HTTP requests.
It can be seen that HTTP protocol has many coding methods from header to body, which can make the transmitted message shorter, save bandwidth and reduce delay.
2、 Channel utilization optimization
After the optimization of coding efficiency, let’s look at “channel“. Although it’s a vocabulary in the field of communication, it’s very suitable to summarize the optimization points of HTTP. I’ll borrow it here.
Channel utilization includes 3 optimization points, the first is multiplexing! On the high-speed low-level channel, many low-speed high-level channels can run.
For example, there is only one network card on the host, but the browser, wechat and pin can send and receive messages at the same time; a process can serve tens of thousands of TCP connections at the same time; a TCP connection can deliver multiple http2 stream messages at the same time.
Secondly, in order to make the channel more efficient, it is necessary to recover the errors in time. Therefore, a large part of TCP work is to discover packet loss and out of order packets in time and deal with them quickly.
Finally, as economics says, resources are always scarce. With limited bandwidth, how to treat different connections, users and objects fairly?
For example, when downloading a page, if you download CSS and pictures with the same priority, there will be problems. It doesn’t matter if the pictures are displayed later, but if CSS doesn’t get the page, it can’t be displayed.
In addition, when transmitting messages, the header does not carry the target information, but it is essential. How to reduce the proportion of these control information?
Let’s start with multiplexing. In a broad sense, multithreading and co programming belong to multiplexing, but here I mainly refer to the stream of http2. Because the HTTP protocol is designed to send a request to the client first, then the server can reply to the response. In this way, there is no way to run the full bandwidth when sending and receiving messages.
The most efficient way is that the sender continuously sends requests and the receiver continuously sends responses, which is particularly effective for the long fat network:
This is how http2 streams reuse connections. We know that chrome can establish up to six connections to a site at the same time. With http2, only one connection can efficiently transfer hundreds of objects on the page.
I specially asked my personal website www.taohui.pub to support both http1 and http2. The figure below is the difference between http2 and http1 from the perspective of connection.
Those who are familiar with the chrome network network panel must be familiar with waterfall. It can help you analyze where the HTTP request is slow, whether the request is sent slowly, whether the response is received slowly, or whether the parsing is too slow. Below is a comparison of my site from the perspective of waterfall.
From these two figures, we can see that http2 is better than http1 in all aspects.
Let’s look at the recovery of network errors. In the application layer, linking time uses delay to close the connection to avoid the browser from receiving HTTP response due to RST error, while timeout uses timer to discover the error and release resources in time.
In the transport layer, TCP can more accurately measure the timeout RTO of timer through timestamp = 1. Of course, there is another purpose of timestamp, which is to prevent the sequence number from looping around in the long fat network.
What is serial number rewinding? We know that every TCP message has a serial number, which does not refer to the order of the message, but the number of bytes that have been sent. Because it is a 32-bit integer, it can process 232, or 4.2GB, in-flight messages at most.
As shown in the figure above, when 1G-2G messages fly in the network for a long time, they will overlap with 5g-6g messages, causing errors.
There are many kinds of network errors, for example, the order of messages can not be guaranteed. Turning on TCP ﹐ sack can reduce the amount of retransmitted messages and the consumption of bandwidth.
When downloading large files directly with Chrome browser, when the network is not good, you have to retransmit them all in case of any error, and the experience is very poor.
It’s much faster to use Thunderbolt to download. This is because Xunlei can split the large file into many small pieces, which can be downloaded by multiple threads. After each small piece fails, it can be downloaded again, which is very efficient.
This breakpoint continuous transmission and multi-threaded download technology is the range protocol of HTTP. If your service is caching, you can also use range protocol, such as nginx’s slice module.
In fact, congestion control is the best algorithm for network error recovery, which can improve the performance of the network. Some students will ask, TCP is not flow control, why will there be network congestion? This is because the processing power of each router in the TCP link does not match each other.
As shown in the figure above, the peak network of R1 is 700m / s, and the peak network of R2 is 600m / s. They all need to pass R3 to reach R4. However, the maximum bandwidth of R3 is only 1000m / S! When the TCP in R1 and R2 uses their bandwidth at full speed, R3 packet loss will be caused. Congestion control is to solve the problem of packet loss.
Since the birth of TCP in 1982, the traditional congestion control algorithm has been used, which is to find the packet loss and then brake to slow down, the effect is very bad.
Why? You can see the following figure. There will be a buffer queue in the router. When the queue is empty, the Ping delay is the shortest; when the queue will be full, the Ping delay is large, but no packet loss has occurred; when the queue is full, packet loss will occur.
Therefore, when the queue has a backlog, packet loss does not occur. Although the peak bandwidth will not be reduced at this time, the network delay becomes larger, which is to be avoided.
The measurement driven congestion control algorithm starts to brake and decelerate at the point of queue backlog. In the era of cheaper memory and larger queues, the new algorithm is particularly effective.
When the Linux kernel is updated to version 4.9, the original cubic congestion control algorithm is replaced by Google’s BBR algorithm.
As can be seen from the figure below, when the packet loss rate reaches 0.01%, cubic is useless, and BBR has no problem. Until the packet loss rate reaches 5%, the bandwidth of BBR drops sharply.
Let’s look at the balanced allocation of resources. In order to treat the connection and users fairly, the server will limit the speed. For example, the Leacky Bucket algorithm in the figure below can smooth the sudden increase of traffic and allocate bandwidth more fairly.
Another example is the priority function in HTTP 2. There are hundreds of objects on a page. These objects are of different importance and some of them are interdependent. For example, some JS files will contain jquery.js, which, if treated equally, cannot be used even after downloading the former.
When http2 allows the browser to download objects, according to the parsing rules, set the weight priority of each object in the stream (255 maximum, 0 minimum). Each agent and resource server will allocate memory and bandwidth according to priority to improve network efficiency.
Finally, look at the packet efficiency of TCP, which will also affect HTTP performance. For example, after the Nagle algorithm is enabled, the number of tabloids in the network is greatly reduced. Considering the 40 byte message header, the proportion of information is higher.
The cork algorithm is similar to the Nagle algorithm, but it will control the small messages more aggressively. Cork and Nagle control small messages from the sender, while quickack controls the number of pure ack small messages from the receiver to improve the information proportion.
3、 Transmission path optimization
Having said the micro channel, let’s look at the third optimization point from the macro perspective: transmission path optimization.
The first optimization point of transmission path is cache, which is ubiquitous in browser, CDN, load balancing and other components.
You are probably familiar with the basic usage of cache. Here we talk about the usage of expired cache. It’s a waste to throw out the expired cache directly, because “expired” is determined by the timer of the client, which does not mean that the resource is really invalid.
Therefore, you can bring its identifier to the source server, and the server will judge whether the cache is still valid. If it is valid, you can directly return 304 and empty body, which is very bandwidth saving.
For load balancing, expired cache can also protect the source server and limit the back to source requests. When the source server hangs up, it can also bring the degraded service experience to users with expired cache, which is much better than returning 503.
The second optimization point of the transport path is slow start. The TCP protocol stack of the system will slowly increase the transmission speed in order to avoid packet loss of the bottleneck router. Its starting speed is called the initial congestion window.
In the early days, the initial congestion window was one MSS (usually 576 bytes), later changed to three MSS (Linux 2.5.32), and then changed to 10 MSS (Linux 3.0) at the suggestion of Google.
The reason why we should continue to improve the starting window is that with the development of the Internet, web pages are more and more abundant and larger. The start window is too small, so it takes longer to download the first page. The experience is poor.
Of course, it’s easy to modify the starting window. In the following figure, you can adjust the window under Linux.
Modifying the start window is a common means of performance optimization. For example, CDN manufacturers have changed the start window. The following figure shows the start window size of mainstream CDN manufacturers in 2014 and 2017.
It can be seen that some windows have been too large for 14 years and have been retracted for 17 years. Therefore, the larger the start window is, the better, it will increase the pressure on the bottleneck router.
Let’s see how to upgrade from pull mode to push mode in the transmission path.
For example, the index.html file contains < link href = “some. CSS” >. But in http / 2, the server can transmit index.html and some.css through two streams in parallel, which saves half of the time.
In fact, when packet loss occurs, the stream parallel transmission of http2 will degrade seriously, because the problem of TCP queue head blocking has not been solved.
Spdy in the figure above is equivalent to http2. When the three streams of red and green are transmitted simultaneously, the TCP layer will still be serialized. Assuming that the red stream is the first one sent, if the red message is lost, even if the receiver has received the complete blue and green stream, TCP will not give it to http2, because TCP itself must ensure the message order. In this way, there is no guarantee of concurrency. This is the problem of queue head blocking.
The way to solve the problem is to bypass TCP and use UDP protocol to implement http. For example, Google’s gquic protocol does this. Station B used it to provide services several years ago.
UDP protocol itself can not guarantee reliable transmission, so gquic needs to re implement what TCP has done over UDP. This is the development direction of HTTP, so at present, http3 is based on gquic in formulating standards.
4、 Information security optimization
Finally, from the perspective of network information security, how to do optimization is discussed. It is actually related to coding, channel and transmission path, but it is also an independent link, so it will be discussed at the end.
Information security in the Internet world began with SSL3.0 in 1995. So far, many large websites have been updated to tls1.3 launched in 2018.
What’s the problem with tls1.2? The biggest problem is that it supports the old key agreement protocols, which are no longer secure. For example, the frank man in the middle attack in 2015 can use the virtual machine on Amazon to capture the servers supporting the old algorithm in minutes.
Tls1.3 cancels the asymmetric key agreement algorithm which is no longer safe mathematically under the current computing force. In the latest implementation of OpenSSL, only five security suites are supported:
Another advantage of tls1.3 is handshake speed. In tls1.2, because two RTTS are needed to negotiate the key, session cache and session ticket are two tools, which reduce the handshake of the negotiation key to one RTT. However, neither of these methods can cope with replay attacks.
The two steps of tls1.2, security suite negotiation and ecdhe public key exchange, are combined into one step in tls1.3, which greatly improves the speed of handshake.
If you are still using tls1.2, upgrade to 1.3 as soon as possible. Besides security, there are also performance benefits.
There are many ways to optimize HTTP performance. From these four dimensions, we can build a tree like knowledge system, including most of the HTTP optimization points.
The optimization of coding efficiency includes HTTP header and body, which can make the data transmitted shorter, smaller and more compact, thus achieving lower latency and higher concurrency. At the same time, a good coding algorithm can also reduce the CPU consumption in encoding and decoding.
The optimization of channel utilization can be carried out from the perspectives of multiplexing, error detection and recovery, and resource allocation, so that the fast underlying channel can effectively carry the slow application layer channel.
The optimization of transmission path, including cache at all levels, slow start, message transmission mode, etc., can make messages sent to browser more timely and improve user experience.
At present, the information security in the Internet is mainly based on TLS protocol. Tls1.3 has a great improvement in security and performance. We should upgrade it in time.
Hope these knowledge can help you to optimize HTTP protocol comprehensively and efficiently!
Tencent cloud TVP
TVP (Tencent cloud valuable professional) is the most valuable expert of Tencent cloud. It is an honorary certification issued by Tencent cloud to technical experts, so as to thank them for their contribution to promoting the development of cloud computing. These technical experts come from various technical fields and industries. They are keen to practice and share, and have made outstanding contributions to the construction of technology community and the promotion of cloud computing.
To learn more about TVP, please follow “cloud plus community” and reply to “TVP”.