How to use WGet command under Linux system

Time:2020-10-27

1、 Introduction to Linux WGet

WGet is a command line download tool on Linux. This is free software under a GPL license. Linux WGet supports HTTP and FTP protocols, proxy server and breakpoint continuation. It can automatically recurse the directory of the remote host, find the qualified files and download them to the local hard disk. If necessary, Linux WGet will properly convert the hyperlinks in the page to generate browsable images locally. Since there is no interactive interface, Linux WGet can run in the background, intercept and ignore the hangup signal, so it can continue to run after the user logs in. Usually, Linux WGet is used to download files from Internet websites in batches or to mirror remote websites.

2、 Examples

Download 192.168.1.168 home page and display download information Linux WGet – D http://192.168.1.168 Download 192.168.1.168 home page without displaying any information WGet – Q http://192.168.1.168 download filelist.txt WGet – I for all linked files contained in filelist.txt

Download to the specified directory WGet – P / tmp ftp://user:[email protected]/file Download the file file to the / tmp directory. Linux WGet is a command line download tool. For us Linux users, we use it almost every day. Here are some useful tips for you to use Linux WGet more efficiently and flexibly.

*

Copy code

The code is as follows:

$ wget -r -np -nd http://example.com/packages/

This command can be downloaded http://example.com All files in the packages directory on the web site. Where, – NP does not traverse the parent directory, and – nd means that the directory structure is not recreated locally.

*

Copy code

The code is as follows:

$ wget -r -np -nd –accept=iso http://example.com/centos-5/i386/

Similar to the previous command, but with an additional — accept = ISO option, this instructs Linux WGet to download only all files with the extension ISO in the i386 directory. You can also specify multiple extensions separated by commas.

*

Copy code

The code is as follows:

$ wget -i filename.txt

This command is often used in the case of batch download, to put the address of all the files to be downloaded to filename.txt Then Linux WGet will automatically download all the files for you.

*

Copy code

The code is as follows:

$ wget -c http://example.com/really-big-file.iso

The function of the – C option specified here is breakpoint continuation.

*

Copy code

The code is as follows:

$ wget -m -k (-H) http://www.example.com/

This command can be used to mirror a website, and Linux WGet will convert the link. You can use the – H option if the images in the site are placed on another site.

3、 Parameters

code:

Copy code

The code is as follows:

$ wget –helpGNU Wget 1.9.1

, a non interactive network file download tool. Usage: Linux WGet [options]… [url]… Required parameters for long options are also required when using short options.

Start up:

-5. — version displays the version of WGet and exits.  
-h. — help print this help.  
-b. – background starts and enters the background operation.  
-e. – execute = command runs a command of the form ‘. Wgetrc’.
Log record and input file:

-o. — output file = file writes log messages to the specified file.  
-a. — append output = the file appends log messages to the end of the specified file.  
-d. — debug prints the debug output.  
-q. — quiet mode (no information output).  
-v. — verbose detailed output mode (default).  
-NV, – non verb turns off detailed output mode, but does not enter quiet mode.  
-i. — input file = file download URL found from the specified file.  
-F. — force HTML processes input files as HTML.  
-B. — Base = URL when using the – F – I file option, add the specified URL before the relative link.
Download:

-t. — tries = number of times to configure the number of retries (0 means unlimited).  
–Retry connrefused even if the connection is denied.  
-O — output document = file writes data to this file.  
-NC, – no clobber does not change the existing file, and does not use the method of adding “# (ා) after the file name to write a new file.  
-c. — continue continues to receive some downloaded files.  
–Progress = mode select the representation of the download progress.  
-N. — timestamping the remote file is not retrieved unless it is newer.  
-S. — server response displays the server response message.  
–The spider does not download any data.  
-T. — timeout = seconds to configure the timeout (in seconds) for reading data.  
-w. — wait = seconds the number of seconds to wait between receiving different files.  
–Waitretry = the number of seconds to wait for a while between retries (ranging from 1 second to the specified number of seconds).  
–Random wait takes a while between receiving different files (ranging from 0 seconds to 2 * wait seconds).  
-Y. — proxy = on / off turns the proxy server on or off.  
-Q. — quota = size to configure the quota size of received data.  
–Bind address = address uses the specified address (host name or IP) of the local computer to connect.  
–Limit rate = rate at which downloads are limited.  
–DNS cache = off suppresses DNS lookup in the cache.  
–Restrict file names = OS restricts the characters in the file name to the characters allowed by the specified OS (operating system).
catalog:

-Nd — no directories does not create a directory.  
-x. — force directories forces the creation of a directory.  
-NH, – no host directories does not create a directory with a remote host name.  
-P. — directory prefix = name create a directory with the specified name before saving the file.  
–Cut dirs = number ignores the specified number of directory layers in the remote directory.
HTTP options:

–HTTP user = user configured HTTP user name.  
–HTTP passwd = password configure HTTP user password.  
-C. — cache = on / off (does not use the data from the cache in the server (it is used by default).  
-E. — HTML extension adds. HTML extension to all files with MIME type text / HTML.  
–Ignore length ignores the “content length” file header field.  
–Header = string adds the specified string to the file header.  
–Proxy user = user configured proxy server user name.  
–Proxy passwd = password configure the proxy server user password.  
–Referer = URL contains the “referer: URL” header in the HTTP request.  
-s. — save headers saves the HTTP header to the file.  
-U. — user agent = agent the flag is agent instead of WGet / version.  
–No HTTP keep alive disable HTTP keep alive.  
–Cookies = off disable cookies.  
–Load cookies = the cookie is loaded by the specified file before the file session starts.  
–Save cookies = saves the cookie to the specified file after the end of the file session.  
–Post data = string use the post method to send the specified string.  
–Post file = the file uses the post method to send the contents of the specified file.
HTTPS (SSL) options:

–Sslcertfile = file optional client segment end certificate.  
–Sslcertkey = the key file that is optional for this certificate.  
–EGD file = file EGD socket file name.  
–Sslcadir = directory where the directory CA hash table is located.  
–Sslcafile = file contains the CA’s file.  
–Sslcerttype = 0 / 1 client cert type 0 = PEM (default) / 1 = ASN1 (DER)
–Sslcheckcert = 0 / 1 checks the server’s certificate against the provided ca
–SSL protocol = 0-3 select SSL protocol; 0 = auto select
1=SSLv2 2=SSLv3 3=TLSv1
FTP options:

-NR, – Don remove listing does not delete the “. Listing” file.  
-g. — glob = on / off sets whether to expand filenames with wildcards.  
–Passive FTP uses “passive” transmission mode.  
–Retr symlinks in recursive mode, download the file indicated by the link (except to the directory).
Recursive Download:

-r. — recursive download.  
-l. — level = the maximum recursive depth of a number (INF or 0 means infinite).  
–Delete after deletes the downloaded file.  
-k. — convert links converts absolute links to relative links.  
-K. — backup converted back up the file x as x.orig before converting it.  
-m. — mirror is equivalent to the option of – R – N – L inf – Nr.  
-p. — page requisites downloads all the files needed to display the complete web page, such as images.  
–Strict comments turns on the strict (SGML) processing option for HTML comments.
Options for accept / reject when downloading recursively:

-A. — accept = list a list of accepted file styles, separated by commas.  
-R. — reject = list of excluded file styles, separated by commas.  
-D. — domains = list of accepted domains, separated by commas.  
–Exclude domains = list of excluded domains, separated by commas.  
–Follow FTP follows FTP links in HTML files.  
–Follow tags = list of HTML tags to follow, separated by commas.  
-G. — ignore tags = list the HTML tags to ignore, separated by commas.  
-H. — span hosts recursion can enter other hosts.  
-50. — relative follows only relative links.  
-1. — include directories = list of directories to download.  
-10. — exclude directories = list of directories to exclude.  
-NP, – no parent does not search the upper directory.

4、 Example: use WGet to batch download files on remote FTP server
I bought a VPS yesterday and migrated the virtual host to VPS. The data must be transferred during the migration process. In the past, the data migration mode of virtual host was very inefficient. The old host packaged and downloaded, then uploaded and decompressed by the new host. Due to the very low bandwidth of the home network and the 512Kbps uplink rate of ADSL unchanged for ten thousand years, it was absolutely physical work to migrate the website before

Now with VPS and shell, this process is extremely simple. With the help of the large bandwidth of the computer room, it is simply a kind of enjoyment for the computer room to directly look at the files transferred between the computer rooms
2015123111528855.png (600×89)

OK, let’s talk about the method:

1. Old virtual host package backup whole station site.tar.gz

2. In the VPS shell, use WGet to download the site.tar.gz , using FTP protocol

Copy code

The code is as follows:

wget –ftp-user=username –ftp-password=password -m -nh ftp://xxx.xxx.xxx.xxx/xxx/xxx/site.tar.gz
wget –ftp-user=username –ftp-password=password -r -m -nh ftp://xxx.xxx.xxx.xxx/xxx/xxx/*

The above is the command, FTP user name password parameters do not explain;

-R optional, which means recursive download. If you download the entire directory directly, this parameter is required;

-M stands for mirror image, no explanation;

-NH means that the hierarchical directory will not be generated, and it will be displayed directly from the current directory, which is a very good parameter;

The following is the FTP address, and the * after the slash means to download all the files in the directory. If it is only one file, you can enter the file name directly.

5、 Q & A

A. Use the WGet tool Linux, so the main version comes with the Linux WGet download tool. Bash $WGet http://place.your.url/here It can also control FTP to download all levels of the entire web site directory, of course, if you’re not careful, You may download the entire website and other websites that link with him. Bash $WGet – M http://target.web.site/subdirectory Because this tool has strong download ability, it can be used as a tool to mirror the website on the server robots.txt ”There are many parameters that control how it is properly mirrored, You can limit the types of links and download files, etc. for example, download only the linked links and ignore GIF images

Copy code

The code is as follows:

bash$ wget -m -L –reject=gif http://target.web.site/subdirectory

Linux WGet can also implement breakpoint continuous transmission (- C parameter), of course, this operation needs the support of remote server

Copy code

The code is as follows:

bash$ wget -c http://the.url.of/incomplete/file

You can combine breakpoint resume with mirroring, so that you can continue to mirror a site with a large number of selective files even if you have broken it many times before. How to do this automatically will be discussed later

You can limit the number of Linux WGet retries if you think it will affect your office

Copy code

The code is as follows:

bash$ wget -t 5 http://place.your.url/here

Never give up and try again with the. Inf. t parameter

B. What should I do about the proxy service? You can use the HTTP proxy parameters or specify a way to download through the proxy in the. Wgetrc configuration file. However, there is a problem. If the proxy is used to continue the breakpoint, there may be several failures. If the download process is interrupted once, the process of downloading through the proxy is interrupted, So when you use “WGet – C” to download the rest of the file, the proxy server looks at its cache and mistakenly thinks that you have downloaded the entire file. Then it sends an error signal. At this time, you can add a specific request parameter to urge the proxy server to clear their cache

Copy code

The code is as follows:

bash$ wget -c –header=”Pragma: no-cache” http://place.your.url/here

This “– header” parameter can be added in various numbers and in various ways. Through it, we can change some properties of web server or proxy server. Some sites do not provide file services with external connections, and content is only submitted through other pages on the same site. At this time, you can add “referer:” parameter: bash $WGet – header: “referer: http://coming.from.this/page ” http://surfing.to.this/page Some special websites only support a certain browser. In this case, you can use the “user agent:” parameter

Copy code

The code is as follows:

bash$ wget –header=”User-Agent: Mozilla/4.0 (compatible; MSIE 5.0;Windows NT; DigExt)” http://msie.only.url/here

C. How do I set the download time?
If you need to download large files on your office computer through a connection shared with other colleagues, and you want your colleagues not to be affected by the slow down of network speed, you should try to avoid rush hours. Of course, you don’t have to wait in the office for everyone to leave, and you don’t have to worry about downloading online after dinner at home. With at, you can customize the working time well: bash $at 23:00warning: commands will be executed using / bin / shat > WGet http://place.your.url/hereat >Press ctrl-d so that we set the download to take place at 11 p.m. In order to make this arrangement work properly, please make sure that the background program ATD is running.

D. Does it take a lot of time to download?
When you need to download a large amount of data and you don’t have enough bandwidth, you will often find that the day’s work is about to start before your scheduled download task is completed.
As a good colleague, you can only stop these tasks and start another one. Then you need to reuse “WGet – C” repeatedly to complete your download. This must be too cumbersome, so it’s better to use crontab to automate it. Create a plain text file called crontab.txt ”, including the following: 0 23 * * 1-5 WGet – C – n http://place.your.url/here0 6 * * 1-5 kill wgetz this crontab file specifies that certain tasks are performed periodically. The first five columns state when the command is executed, and the rest of each line tells crontab what to do.

The first two columns specify that Linux WGet downloads should be started from 11:00 p.m. and all Linux WGet downloads should be stopped at 6:00 a.m. The * in the third and fourth columns indicates that the task is performed on every day of the month. The fifth column specifies the day of the week to execute the program. – “1-5” means from Monday to Friday. In this way, the download starts at 11:00 p.m. every weekday, and by 6:00 a.m., any Linux WGet task is stopped. You can do this with the following command

Copy code

The code is as follows:

crontab:bash$ crontab crontab.txt

The “- n” parameter of Linux WGet will check the timestamp of the target file. If it matches, the download program will stop because it indicates that the whole file has been downloaded completely. Use “crontab – R” to delete the schedule. I have used this method for many times. I have downloaded a lot of ISO image files through shared phone dialing, which is more practical.

E. How to download dynamic web pages
Some web pages change several times a day according to the requirements. So technically, the target is no longer a file. It has no file length. Therefore, the “- C” parameter is meaningless

Copy code

The code is as follows:

bash$ wget http://lwn.net/bigpage.php3

The network condition in my office is often very bad, which brings me a lot of trouble to download, so I wrote a simple script to check whether the dynamic page has been completely updated

Copy code

The code is as follows:

#!/bin/bash
#create it if absent
touch bigpage.php3
#check if we got the whole thing
while ! grep -qi bigpage.php3
do
rm -f bigpage.php3
#download LWN in one big page
wget http://lwn.net/bigpage.php3
done

This script can ensure that the page is downloaded continuously until “” appears in the page, which means that the file has been completely updated

F. What about SSL and cookies?
If you want to access the Internet through SSL, the website address should start with “HTTPS: / /”. In this case, you need another download tool, It is called curl, which can be easily obtained. Some websites force users to use cookies when browsing. Therefore, you must get the “cookie:” parameter from the cookie you get on the website. In this way, you can ensure that the download parameters are correct

Copy code

The code is as follows:

bash$ cookie=$( grep nytimes ~/.lynx_cookies |awk {printf(”%s=%s;”,$6,$7)} )

You can construct a request cookie to download http://www.nytimes.com W3m uses a different, smaller cookie file format:

Copy code

The code is as follows:

bash$ cookie=$( grep nytimes ~/.w3m/cookie |awk {printf(”%s=%s;”,$2,$3)} )

Now you can download it in this way:

Copy code

The code is as follows:

bash$ wget –header=”Cookie: $cookie” http://www.nytimes.com/reuters/technology/tech-tech-supercomput.html

Or use the curl tool:

Copy code

The code is as follows:

bash$ curl -v -b $cookie -o supercomp.html http://www.nytimes.com/reuters/technology/tech-tech-supercomput.htm

G. How to create an address list?
So far, we’ve been downloading a single file or an entire website. Sometimes we need to download a large number of files linked to a web page, but it’s not necessary to mirror the whole website. For example, we want to download the top 20 songs from a list of 100 songs. Note that the “– accept” and “– reject” parameters do not work, Because they only work on file operations, be sure to use the “Lynx – dump” parameter instead

Copy code

The code is as follows:

bash$ lynx -dump ftp://ftp.ssc.com/pub/lg/ |grep gz$ |tail -10 |awk {print $2} > urllist.txt

The output of lynx can be considered by various GNU text processing tools. In the above example, our link address ends with “GZ”, and the last 10 file addresses are placed in the urllist.txt Then we can write a simple bash script to automatically download the target file in this file

Copy code

The code is as follows:

bash$ for x in $(cat urllist.txt)
> do
> wget $x
> done

So we can download the Linux Gazette website successfully( ftp://ftp.ssc.com/pub/lg/ )The latest 10 topics on

H. Expand bandwidth used
If you choose to download a bandwidth limited file, your download will be slow due to server-side restrictions. The following technique will greatly shorten the download process. However, this technique requires you to use curl and the remote server has multiple images for you to download. For example, suppose you want to download Mandrake 8.0 from the following three addresses:

Copy code

The code is as follows:

url1=http://ftp.eecs.umich.edu/pub/linux/mandrake/iso/Mandrake80-inst.iso
url2=http://ftp.rpmfind.net/linux/Mandrake/iso/Mandrake80-inst.iso
url3=http://ftp.wayne.edu/linux/mandrake/iso/Mandrake80-inst.iso

The length of this file is 677281792 bytes, so the curl program with “– range” parameter is used to create three simultaneous downloads:

Copy code

The code is as follows:

bash$ curl -r 0-199999999 -o mdk-iso.part1 $url1 &
bash$ curl -r 200000000-399999999 -o mdk-iso.part2 $url2 &
bash$ curl -r 400000000- -o mdk-iso.part3 $url3 &

This creates three background processes. Each process transfers different parts of the ISO file from different servers. The “- R” parameter specifies the byte range of the target file
At the end of the process, a simple cat command is used to link the three files together – Cat MDK- iso.part? >Mdk-80. ISO. (it is strongly recommended to check MD5 before engraving)
You can also use the “- verb” parameter to make each curl process have its own window to display the transfer process