Python 3 Web Crawler Actual Warfare – 22. Using Urllib: Parsing Links

Time:2019-9-11

Last article: Python 3 Web Crawler Actual Warfare – 21. Using Urllib: Handling Exceptions
Next article: Python 3 Web Crawler Actual Warfare – 23. Using Urllib: Analyzing Robots Protocol

The Urllib library also provides parse, which defines standard interfaces for handling URLs, such as extracting, merging and linking transformations of various parts of URLs. It supports URL processing of the following protocols: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, SVN + ssh, telnet, wais. In this section, we introduce some common methods of this module to feel its convenience. Place.

1. urlparse()

The urlparse () method can recognize and segment URLs. Let’s first use an example to feel that:

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

Here we use the urlparse () method to parse a URL. First we output the type of parsing result, and then we output the result as well.

Operation results:

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

As you can see, the return result is a ParseResult type object, which contains six parts: scheme, netloc, path, params, query, fragment.

Look at the URL of the instance:

http://www.baidu.com/index.html;user?id=5#comment

The urlparse () method divides it into six parts. From a general observation, it can be found that there are specific separators in the parsing process, such as: // in front of the scheme, which represents the protocol, netloc in front of the first one, which is the domain name, semicolon, and params in front, which represents the parameters.

So we can get a standard link format as follows:

scheme://netloc/path;parameters?query#fragment

A standard URL meets this rule, and we can parse it separately using the urlparse () method.

In addition to this basic way of parsing, is there any other configuration for the urlopen () method? Next, take a look at its API usage:

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

You can see that it has three parameters:

  • Urlstring is a must-fill, that is, the URL to be resolved.
  • Scheme is the default protocol (such as http, https, etc.). If the link does not contain protocol information, it will be the default protocol.

Let’s take an example to feel:

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)

Operation results:

ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

You can see that the URL we provide does not contain the first scheme information, but by specifying the default scheme parameter, the result is https.

Suppose we bring scheme with us?

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')

The results are as follows:

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

It can be seen that the scheme parameter will only take effect if the scheme information is not included in the URL. If there is scheme information in the URL, the resolved scheme will be returned.

  • Allow_fragments, that is, whether to ignore fragments, if it is set to False, the fragments will be ignored, and it will be parsed as a part of path, parameters or query, and the fragments will be empty.

Let’s take an example to feel the following:

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

Operation results:

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

What if the URL does not contain parameters and query?

Let’s take another example.

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result)

Operation results:

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

It can be found that when params and query are not included in the URL, fragments are parsed as part of the path.

The result ParseResult is actually a tuple, which can be retrieved either by index order or by attribute name. Examples are as follows:

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html#comment', allow_fragments=False)
print(result.scheme, result[0], result.netloc, result[1], sep='\n')

Here we obtain scheme and netloc with index and attribute names respectively. The results are as follows:

http
http
www.baidu.com
www.baidu.com

It can be found that the two results are consistent, and both methods can be successfully obtained.

2. urlunparse()

With urlparse (), there is its corresponding cubic method urlunparse ().

The parameter it accepts is an iterative object, but its length must be 6, otherwise it will throw out the problem of insufficient or excessive number of parameters.

First of all, let’s take an example to feel:

from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

The parameter data uses the list type, but you can also use other types, such as tuples or specific data structures.

The results are as follows:

http://www.baidu.com/index.html;user?a=6#comment

In this way, we have successfully implemented the construction of URLs.

3. urlsplit()

This is very similar to the urlparse () method, except that it does not parse the parameters section alone and returns only five results. The parameters in the above example will be merged into the path, so let’s take a look at an example.

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

Operation results:

SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

It can be found that the return result is SplitResult, which is also a tuple type. It can be obtained either by attributes or by indexes. Examples are as follows:

from urllib.parse import urlsplit

result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result.scheme, result[0])

Operation results:

http http

4. urlunsplit()

Similar to urlunparse (), it is also a way to combine the various parts of a link into a complete link, and to pass in an iterative object, such as a list, tuple, and so on. The only difference is that the length must be 5.

Take an example to feel:

from urllib.parse import urlunsplit

data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))

Operation results:

http://www.baidu.com/index.html?a=6#comment

Similarly, the stitching of links can be completed.

5. urljoin()

With the urlunparse () and urlunsplit () methods, we can complete the merging of links, but only if there is a specific length of objects, each part of the link should be clearly separated.

There is another way to generate links. Using the urljoin () method, we can provide a base_url (base link). The new link is the second parameter. The method will analyze the scheme, netloc and path of base_url to supplement the missing parts of the new link and return as a result.

Let’s take a few examples to feel:

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))

Operation results:

http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

It can be found that base_url provides three items, scheme, netloc, path. If these three items do not exist in the new link, then they are supplemented. If the new link exists, then use the part of the new link. Parameters, query, fragments in base_url do not work.

Through the functions mentioned above, we can easily realize the analysis, combination and generation of links.

6. urlencode()

Let’s introduce a common URLEncode () method, which is very useful in constructing GET request parameters. Let’s feel it with an example.

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

We first declare a dictionary, represent the parameters, and then call URLEncode () to serialize them into the URL standard GET request parameters.

Operation results:

http://www.baidu.com?name=germey&age=22

As you can see, the parameters are successfully converted from dictionary type to GET request parameters.

This method is very common. Sometimes, in order to construct parameters more conveniently, we use dictionary to express them beforehand. To convert parameters into URLs, we only need to call this method.

7. parse_qs()

If we have a string of GET request parameters, we can use parse_qs() method to turn it back to the dictionary. Let’s feel it with an example:

from urllib.parse import parse_qs

query = ‘name=germey&age=22’
print(parse_qs(query))

Operation results:

{'name': ['germey'], 'age': ['22']}

As you can see, this successfully turns back to the dictionary type.

8. parse_qsl()

There is also a parse_qsl() method that converts parameters into a list of tuples, as shown below.

from urllib.parse import parse_qsl

query = 'name=germey&age=22'
print(parse_qsl(query))

Operation results:

[('name', 'germey'), ('age', '22')]

As you can see, the result is a list. Each element of the list is a tuple. The first content of the tuple is the parameter name, and the second content is the parameter value.

9. quote()

The quote () method can transform the content into a URL encoding format. Sometimes the Chinese parameters in the URL may lead to scrambling problems. So we can use this method to convert Chinese characters into URL encoding. Examples are as follows:

from urllib.parse import quote

Keyword = wallpaper
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url

Here we declare a Chinese search text, and then use quote () method to encode its URL. The final results are as follows:

https://www.baidu.com/s?wd=%E…

In this way, we can successfully implement the conversion of URL encoding.

10. unquote()

With the quote () method and, of course, the unquote () method, it can decode the URL. Examples are as follows:

from urllib.parse import unquote

url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))

This is the result of the URL coding obtained above. Here we use the unquote () method to restore it. The results are as follows:

Https://www.baidu.com/s?Wd=wallpaper

It can be seen that the unquote () method can be used to realize decoding conveniently.

11. conclusion

This section introduces some common URL processing methods of parse module. With these methods, we can easily realize the analysis and construction of URLs. It is recommended that we master them skillfully.

Last article: Python 3 Web Crawler Actual Warfare – 21. Using Urllib: Handling Exceptions
Next article: Python 3 Web Crawler Actual Warfare – 23. Using Urllib: Analyzing Robots Protocol

Recommended Today

Introduction of SDK for Rainbow Soft Face Recognition under Qt

brief introduction In this paper, we will briefly introduce the application process of Hongsoft Face Recognition SDK under Qt platform, which includes three main steps: material preparation, environment construction and code implementation, to help us in the process of reference. Development environment: win10 Qt5.11.2 (Mingw 32 bit) stores reserve Face Recognition SDK (ArcSoft_ArcFace) Download HongSoft […]