Roll up a multi-threaded breakpoint download, I learned from these knowledge

Time:2021-3-8

The article has been included inGithub.com/niumoo/JavaNotesMore Java programmers need to master the core knowledge, welcome star and advice.
Welcome to follow meofficial accountThe article is updated weekly.

Thank you for coming in. I have nothing to do at the weekendMy colleague, brother QiangHave you ever playedBreakpoint continuation?” Then I thought about it,Breakpoint continuationThere are a lot of downloads. I really haven’t thought about the details. No, after thinking, I have this article. Thanks to brother Qiang, let me have an article that can water, the following will use pure java to achieve a simple dependency freeMultithreading breakpoint continuation Downloader

What is the content of this article? Let’s give a brief list and think about a few questions by the way.

  1. The principle of breakpoint continuation.
  2. How to ensure the consistency of files when resuming files?
  3. How to realize multi thread download of the same file?
  4. Network speed and bandwidth are fixed. Why can multi thread download speed up?

What knowledge will be used for multithreading breakpoint continuation? Several questions have been raised above. Let’s think about them. Now most of the services are available online, and there are fewer and fewer download scenarios. However, this does not hinder our exploration of the principle.

The principle of breakpoint continuation

If you want to understand how breakpoint continuation is implemented, you must understand the HTTP protocol. HTTP protocol is one of the most widely used network transport protocols on the InternetTCP/IPCommunication protocol to transfer data. So the secret of breakpoint continuation is hidden in the HTTP protocol.

We all know that HTTP requests have aRequest headerandResponse headerIn the request header and response header, there is a parameter related to range. Below through the Baidu network disk PC client download link for testing.

If you want to know more about the usage of curl, you can see my previous article:Come in and enjoy the unique skill of curl

$ curl -I http://wppkg.baidupcs.com/issue/netdisk/yunguanjia/BaiduYunGuanjia_7.0.1.1.exe
HTTP/1.1 200 OK
Server: JSP3/2.0.14
Date: Sat, 25 Jul 2020 13:41:55 GMT
Content-Type: application/x-msdownload
Content-Length: 65804256
Connection: keep-alive
ETag: dcd0bfef7d90dbb3de50a26b875143fc
Last-Modified: Tue, 07 Jul 2020 13:19:46 GMT
Expires: Sat, 25 Jul 2020 14:05:19 GMT
Age: 257796
Accept-Ranges: bytes
Cache-Control: max-age=259200
Content-Disposition: attachment;filename="BaiduYunGuanjia_7.0.1.1.exe"
x-bs-client-ip: MTgwLjc2LjIyLjU0
x-bs-file-size: 65804256
x-bs-request-id: MTAuMTM0LjM0LjU2Ojg2NDM6NDM4MTUzMTE4NTU3ODc5MTIxNzoyMDIwLTA3LTA3IDIyOjAxOjE1
x-bs-meta-crc32: 3545941535
Content-MD5: dcd0bfef7d90dbb3de50a26b875143fc
superfile: 2
Ohc-Response-Time: 1 0 0 0 0 0
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, PUT, POST, DELETE, OPTIONS, HEAD
Ohc-Cache-HIT: bj2pbs54 [2], bjbgpcache54 [4]

We can see that Baidu PC client has a lot of response header information, we only need to focus on a few.

Content length: 65804256 // the size of the requested file, in bytes
Accept ranges: bytes // allow to specify the transmission range, bytes: the unit of range request is bytes, none: no range request unit is supported,
Last modified: Tue, 07 Jul 2020 13:19:46 GMT // the last modified time of the server file can be used to verify whether the file has been changed
X-bs-meta-crc32: 3545941535 // CRC32, which can be used to verify whether the file has been changed
Etag: dcd0bfef7d90dbb3de50a26b875143fc // Etag tag, which can be used to check whether the file has been changed

It can be seen that not all downloads support breakpoint continuation, only in response headerAccept-Ranges: bytes Field before resuming. If there is this information, how to continue transmission? In fact, it only needs to be specified in the response headerContent-RangeThat’s fine.

Content-RangeThere are several formats to use.

Content range: < unit > = < range start > - < range end > / < size > // size is the total file size. If you don't know, you can use it*
Content-Range: <unit>=<range-start>-<range-end>/*  
Content-Range: <unit>=<range-start>-
Content-Range: <unit>=*/<size>

give an example

Unit bytes, download from the 10th byte:Content-Range: bytes=10-.

Unit bytes, download from the 10th byte to the 100th byte:Content-Range: bytes=10-100.

This is the principle of breakpoint continuation. You can see that the start and end of content range have made it possible to download in segments.

How to ensure the consistency of documents?

There are two aspects to the integrity of documents. One isDownload phaseYes, one isWrite phaseYes.

Because the downloader we want to write supports breakpoint continuation, how can we make sure that the file has not been updated since our last download? In fact, it can be judged by several attribute values in the response header.

Last modified: Tue, 07 Jul 2020 13:19:46 GMT // the last modified time of the server file can be used to verify whether the file has been changed
Etag: dcd0bfef7d90dbb3de50a26b875143fc // Etag tag, which can be used to check whether the file has been changed
X-bs-meta-crc32: 3545941535 // CRC32, which can be used to verify whether the file has been changed

Last-ModifiedandETagCan be used to check whether the file has been updated, according to the provisions of the HTTP protocol, when the file is updated, it will generate a new fileETagValue, which is similar to the fingerprint information of a file, andLast-ModifiedSometimes it may not be able to prove that the content of the document has been modified.

The above is the file consistency check in the download phase. What about the write phase? No matter single thread or multi thread, due to the breakpoint continuation, it is necessary to write in theDesignated locationAdd characters. Is there a good way to implement it in Java?

The answer is yesRandomAccessFileClass,RandomAccessFileDifferent from other stream operations. It can be used to specify the read-write mode, using theseekMethod randomly move the file pointer position to be operated. It is very suitable for the writing scenario of breakpoint continuation.

For example, in test.txt Start writing character ABC at position 0 and start writing character DDD at position 100

try (RandomAccessFile rw = new RandomAccessFile(" test.txt "," RW ")) {// RW is read-write mode
    rw.seek (0); // move the file content pointer
    rw.writeChars("abc");
    rw.seek(100);
    rw.writeChars("ddd");
}

The writing of breakpoint continuation depends on it. When the file is renewed, you only need to move the file content pointer to the position to be renewed.

seekThere are also many magical uses of the method, such as using it you canFast positioningGo to a known location and do the testQuick searchIt can also be done in different locations of the same fileConcurrent read write

How to realize multi thread download?

Multithreading download requires each thread to download a part of the file, and then assemble the content of the file downloaded by each thread into a complete file. In this process, there must be no error in a byte, otherwise the file you assemble will not work. So how to download part of the file? In fact, it has been introduced in the part of breakpoint continuationContent-RangeParameter, just calculate the bytes range of each part to download.

For example: unit bytes, the second part starts from the 10th byte and downloads to the 100th byte:Content-Range: bytes=10-100.

Network speed and bandwidth are fixed. Why can multi thread download speed up?

This is an interesting problem. The maximum network speed is fixed. The operator gives you 100Mbs network speed. No matter how you use it, the maximum speed is 100g8 = 12.5mb/s. Since the bottleneck is here, why can multi thread download speed up? In fact, in theory, a single thread download can achieve the maximum network speed. But often the fact is that the network is not so smooth, very congested, it is difficult to achieve the ideal maximum speed. in other wordsOnly when the network is not so smooth, multi-threaded download can speed up. Otherwise, single thread is enough. But the maximum speed is always network bandwidth.

So why can multithreading download speed up? HTTP protocol transmits data based on TCP protocol. In order to understand this problem, we need to understand the characteristics of TCP protocolcongestion control Mechanism.congestion control It’s a part of TCPAvoid network congestionThe algorithm, which is based onHarmony increase / multiplicity decreaseThis control method is used to control congestion.

Roll up a multi-threaded breakpoint download, I learned from these knowledge

In short, when TCP starts to transmit data, the server will continuously detect the available bandwidth. In aTransmission content segmentAfter it is successfully received, it will double the transmission of two times of the segment content. If it is successfully received again, it will continue to double until it occursPacket lossThis is what this is also calledSlow start. When reachedSlow start threshold (ssthresh)At the same time, the full start algorithm will change to a linear growth stage, adding only one segment at a time to slow down the growth rate. I think in fact, the doubling growth process of slow start is not slow, it’s just a way to call it.

But when a packet loss occurs, that is, when congestion is detected, the sender will change the size of the sending segmentReduce a multiplierFor example, half of the slow start threshold drops to before timeoutHalf the size of the congestion window, congestion window will be reduced to 1 MSS, andBack to slow startStage. At this time, the advantage of multithreading is reflected, because your multithreading will slow down the speed less violently. After all, there may be another thread in the final acceleration stage of slow start, so the overall download speed is better than that of single thread.

Implementation of multithread breakpoint continuation code

Based on the above principle introduction, I should have a specific implementation idea. We just need to use multithreading, combiningContent-RangeParameter segmentation request file content to save to a temporary file, download after useRandomAccessFile Merge the downloaded files into one file. When you need to continue the breakpoint, you only need to read the current temporary file size and adjust it.Content-Range, you can continue to download.

Code is not much, the following is part of the core code, complete code can directly click the GitHub warehouse at the end of the article.

  1. Content-RangeRequest the interval content of the specified file.
URL httpUrl = new URL(url);
HttpURLConnection httpConnection = (HttpURLConnection)httpUrl.openConnection();
httpConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36");
httpConnection.setRequestProperty("RANGE", "bytes=" + start + "-" + end + "/*");
InputStream inputStream = httpConnection.getInputStream();
  1. Gets the Etag of the file
Map<String, List<String>> headerFields = httpConnection.getHeaderFields();
List<String> eTagList = headerFields.get("ETag");
System.out.println(eTagList.get(0));
  1. useRandomAccessFile Continue to write to file.
RandomAccessFile oSavedFile = new RandomAccessFile(httpFileName, "rw");
oSavedFile.seek (localfilecontentlength); // the file write start position pointer moves to the downloaded position
byte[] buffer = new byte[1024 * 10];
int len = -1;
while ((len = inputStream.read(buffer)) != -1) {
    oSavedFile.write(buffer, 0, len);
}

After downloading part of it, close the program and start it again.

Roll up a multi-threaded breakpoint download, I learned from these knowledge

The full code has been uploaded togithub.com/niumoo/down-bit.

reference resources:

[1] HTTP headers

[2] Class RandomAccessFile

[3] Introduction and use of random access file

[4] Wikipedia – TCP Congestion Control)

[5] Wikipedia – Harmony growth / multiplicity decline)

Last words

The article has been included inGithub.com/niumoo/JavaNotesWelcome star and your advice. There are also a lot of articles about the interview points of large factories, the core knowledge that Java programmers need to master, and so onStarAnd perfect, hope we become excellent together.

If the article is helpful, you can clickfabulous“Or”share“Yes, it’s all support. I like it all!
Articles are constantly updated every week. To pay attention to my updated articles and shared dry goods in real time, you can pay attention to theUnread codeThe official account orMy blog

Roll up a multi-threaded breakpoint download, I learned from these knowledge

Recommended Today

Large scale distributed storage system: Principle Analysis and architecture practice.pdf

Focus on “Java back end technology stack” Reply to “interview” for full interview information Distributed storage system, which stores data in multiple independent devices. Traditional network storage system uses centralized storage server to store all data. Storage server becomes the bottleneck of system performance and the focus of reliability and security, which can not meet […]