The reason why BS4 library can quickly locate the elements we want is that it can parse the HTML file in a way, and different parsers have different effects. The following will be introduced one by one.
The choice of BS4 parser
The ultimate goal of web crawler is to filter and select network information, and the most important part can be said to be the parser. The merits and demerits of the parser determine the speed and efficiency of the crawler. The BS4 library supports the‘ html.parser ’Besides the parser, it also supports many third-party parsers. Let’s make a comparative analysis of them.
The official recommendation of BS4 library is lxml parser, because it is more efficient, so we will also use lxml parser.
PS note: many people will encounter a variety of vexation problems in the process of learning python, and it is easy to give up if no one answers. For this small editor built a python stack free Q & A. skirt: seven clothes, nine seven seven bar and five (the number of homophony) conversion can be found, do not understand the problem has the old driver to solve, there is the latest Python practical course, no need to go down, supervise each other and make progress together!
Installation of lxml parser:
- The PIP installation tool is still used to install:
$ pip install lxml
Note that since I use UNIX class system, PIP tool is very convenient. However, if you install under windows, there will always be some problems. Here, we recommend that win users go to lxml official, download the installation package, and install the lxml parser suitable for their own system version.
Using lxml parser to interpret web pages
We still use the Alice document in the previous article as an example
html_doc = """ >>>The Dormouse's story>> > class="title">>The Dormouse's story>> class="story">Once upon a time there were three little sisters; and their names were href="http://example.com/elsie" class="sister" id="link1">Elsie>, href="http://example.com/lacie" class="sister" id="link2">Lacie> and href="http://example.com/tillie" class="sister" id="link3">Tillie>; and they lived at the bottom of a well.> class="story">...> """
import bs4 #First of all, let's make a pot of soup in the way of HTML file lxml Soup = BS4. Beautiful soup (open ('beautiful soup crawler/ demo.html '),'lxml') #Let's output the result, which is a very clear tree structure. #print(soup.prettify()) ''' OUT: The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... '''
How to use it?
The BS4 library first converts the incoming string or file handle to the Unicode type, so that when we grab the Chinese information, we will not have a very troublesome encoding problem. Of course, some obscure codes, such as “Big5”, need to be set manually
soup = BeautifulSoup(markup, from_ Encoding = encoding method)
Types of objects:
BS4 library transforms a complex HTML document into a complex tree structure. Each node is a python object. All objects can be divided into the following four types: tag, navigablestring, beautifulsoup, comment
Let’s explain one by one:
Tag: it is basically the same as tag in HTML and can be used easily
Navigablestring: a string wrapped in a tag
Beautiful soup: it represents the whole content of a document. Most of the time, it can be regarded as a tag object. It supports the methods of traversing and searching the document tree.
Comment: This is a special navigablesting object that, when it appears in an HTML document, will be output in a special format, such as the comment type.
The easiest way to search the document tree is to search for the name of the tag you want to get:
soup.head # The Dormouse's story soup.title # The Dormouse's story
If you want to go deeper into getting smaller Tags: for example, we want to find the part under the body that is wrapped by the B tag
soup.body.b # The Dormouse's story
But this method can only find the first tag in order
How about getting all the tags?
Find is needed at this time_ All () method, which returns a list type
tag=soup.find_all('a') # [Elsie, # Lacie, # Tillie] #Suppose we want to find the second element in the a tag: need = tag #Easy
The. Contents attribute of tag can output the child nodes of tag in the form of list:
head_tag = soup.head head_tag # The Dormouse's story head_tag.contents [>The Dormouse's story</title>] title_tag = head_tag.contents print(title_tag) # The Dormouse's story title_tag.contents # [u'The Dormouse's story']
- In addition, through the. Children generator of tag, you can cycle the child nodes of tag
for child in title_tag.children: print(child) # The Dormouse's story
- In this way, only child nodes can be traversed. How to traverse the descendant node?
Descendant node: for example head.contents The child nodes of are
The Dormouse’s storyHere, the title itself also has a child node: “the Dormouse’s story.”. The Dormouse’s story here is also called the descendant node of head
for child in head_tag.descendants: print(child) # The Dormouse's story # The Dormouse's story
How to find all the text content under the tag?
1. If the tag has only one child node (navigablestring type): use tag.string You can find it.
2. If a tag has many children and grandchildren, and each node has a string:
We can find them all by iteration
for string in soup.strings: print(repr(string)) # u"The Dormouse's story" # u'\n\n' # u"The Dormouse's story" # u'\n\n' # u'Once upon a time there were three little sisters; and their names were\n' # u'Elsie' # u',\n' # u'Lacie' # u' and\n' # u'Tillie' # u';\nand they lived at the bottom of a well.' # u'\n\n' # u'...' #OK, let's talk about the basic use of the BS4 library. The remaining parts: parent node, sibling node, fallback and forward are similar to the process of finding elements from child nodes above
Many people will encounter a variety of vexation problems in the process of learning python, and it is easy to give up if no one answers them. For this small editor built a python stack free Q & A. skirt: seven clothes, nine seven seven bar and five (the number of homophony) conversion can be found, do not understand the problem has the old driver to solve, there is the latest Python practical course, no need to go down, supervise each other and make progress together!
The text and pictures of this article are from the Internet and my own ideas. They are for study and communication only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.