How to use Python crawler BS4 parser correctly

Time:2020-10-24

 

The reason why BS4 library can quickly locate the elements we want is that it can parse the HTML file in a way, and different parsers have different effects. The following will be introduced one by one.

The choice of BS4 parser

The ultimate goal of web crawler is to filter and select network information, and the most important part can be said to be the parser. The merits and demerits of the parser determine the speed and efficiency of the crawler. The BS4 library supports the‘ html.parser ’Besides the parser, it also supports many third-party parsers. Let’s make a comparative analysis of them.

The official recommendation of BS4 library is lxml parser, because it is more efficient, so we will also use lxml parser.
PS note: many people will encounter a variety of vexation problems in the process of learning python, and it is easy to give up if no one answers. For this small editor built a python stack free Q & A. skirt: seven clothes, nine seven seven bar and five (the number of homophony) conversion can be found, do not understand the problem has the old driver to solve, there is the latest Python practical course, no need to go down, supervise each other and make progress together!

Installation of lxml parser:

  • The PIP installation tool is still used to install:

$ pip install lxml

Note that since I use UNIX class system, PIP tool is very convenient. However, if you install under windows, there will always be some problems. Here, we recommend that win users go to lxml official, download the installation package, and install the lxml parser suitable for their own system version.

Using lxml parser to interpret web pages

We still use the Alice document in the previous article as an example

html_doc = """
    >>>The Dormouse's story>>
    >
    class="title">>The Dormouse's story>>
    
    class="story">Once upon a time there were three little sisters; and their names were
    href="http://example.com/elsie" class="sister" id="link1">Elsie>,
    href="http://example.com/lacie" class="sister" id="link2">Lacie> and
    href="http://example.com/tillie" class="sister" id="link3">Tillie>;
    and they lived at the bottom of a well.>
    
    class="story">...>
    """

Try it:

import bs4
    
    
#First of all, let's make a pot of soup in the way of HTML file lxml
Soup = BS4. Beautiful soup (open ('beautiful soup crawler/ demo.html '),'lxml')
    
#Let's output the result, which is a very clear tree structure.
#print(soup.prettify())
    
'''
OUT:
    

 
  
   The Dormouse's story
  
 
 
  
   
    The Dormouse's story
   
  
  
   Once upon a time there were three little sisters; and their names were
   
    Elsie
   
   ,
   
    Lacie
   
   and
   
    Tillie
   
   ;
and they lived at the bottom of a well.
  
  
   ...
  
 

'''

How to use it?

The BS4 library first converts the incoming string or file handle to the Unicode type, so that when we grab the Chinese information, we will not have a very troublesome encoding problem. Of course, some obscure codes, such as “Big5”, need to be set manually
soup = BeautifulSoup(markup, from_ Encoding = encoding method)

Types of objects:

BS4 library transforms a complex HTML document into a complex tree structure. Each node is a python object. All objects can be divided into the following four types: tag, navigablestring, beautifulsoup, comment
Let’s explain one by one:

Tag: it is basically the same as tag in HTML and can be used easily

Navigablestring: a string wrapped in a tag

Beautiful soup: it represents the whole content of a document. Most of the time, it can be regarded as a tag object. It supports the methods of traversing and searching the document tree.

Comment: This is a special navigablesting object that, when it appears in an HTML document, will be output in a special format, such as the comment type.

The easiest way to search the document tree is to search for the name of the tag you want to get:

soup.head
# The Dormouse's story

soup.title
# The Dormouse's story

If you want to go deeper into getting smaller Tags: for example, we want to find the part under the body that is wrapped by the B tag

soup.body.b
# The Dormouse's story

But this method can only find the first tag in order

How about getting all the tags?

Find is needed at this time_ All () method, which returns a list type

tag=soup.find_all('a')
# [Elsie,
#  Lacie,
#  Tillie]
#Suppose we want to find the second element in the a tag:
need = tag[1]
#Easy

The. Contents attribute of tag can output the child nodes of tag in the form of list:

head_tag = soup.head
head_tag
# The Dormouse's story

head_tag.contents
[>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
print(title_tag)
# The Dormouse's story
title_tag.contents
# [u'The Dormouse's story']
  • In addition, through the. Children generator of tag, you can cycle the child nodes of tag

for child in title_tag.children:
    print(child)
    # The Dormouse's story
  • In this way, only child nodes can be traversed. How to traverse the descendant node?

Descendant node: for example head.contents The child nodes of are

The Dormouse’s storyHere, the title itself also has a child node: “the Dormouse’s story.”. The Dormouse’s story here is also called the descendant node of head

for child in head_tag.descendants:
    print(child)
    # The Dormouse's story
    # The Dormouse's story

How to find all the text content under the tag?

1. If the tag has only one child node (navigablestring type): use tag.string You can find it.

2. If a tag has many children and grandchildren, and each node has a string:

We can find them all by iteration

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    #OK, let's talk about the basic use of the BS4 library. The remaining parts: parent node, sibling node, fallback and forward are similar to the process of finding elements from child nodes above

Many people will encounter a variety of vexation problems in the process of learning python, and it is easy to give up if no one answers them. For this small editor built a python stack free Q & A. skirt: seven clothes, nine seven seven bar and five (the number of homophony) conversion can be found, do not understand the problem has the old driver to solve, there is the latest Python practical course, no need to go down, supervise each other and make progress together!

The text and pictures of this article are from the Internet and my own ideas. They are for study and communication only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Recommended Today

Analysis of Redux middleware applymiddleware

What is middleware Middleware just wraps the dispatch method of the store. Technically, anything middleware can do can be implemented by manually wrapping the dispatch call, but unified management in the same place will make the expansion of the whole project much easier. Repackaging the dispatch method Why repackage dispatch The role of middleware allows […]