XPath: HTML parsing artifact of Python crawler series

Time:2020-9-25

Through the previous article, we have known how to obtain the web page and download the file, but the previous web page we obtained is unprocessed, redundant information is too much, can not be analyzed and used

In this section, we’ll learn how to filter the information you need from the web page, and recommend a full-fledged Python learning solution Seven clothes nine seven seven bar five (Digital homophony) conversion can be found, here are senior programmers to share previous learning experience, learning notes, as well as work experience of first-line enterprises, and carefully collate a copy of Python zero foundation to project actual combat materials, explain the latest technology of python, prospects, learning need to leave a message of small details

When it comes to information filtering, we immediately think of regular expressions, but we won’t talk about them today. Because for crawlers, regular expressions are too complex to be friendly to novices. Moreover, regular expressions have poor fault tolerance. If the webpage is slightly changed, the matching expressions have to be rewritten. In addition, the readability of regular expressions is almost zero.

Of course, this is not to say that regularization is not good, just that it is not suitable for reptiles and novices. In fact, regularization is very powerful. We will use it in the following data cleaning.

Since regularization doesn’t work, what should be used? Don’t worry, python provides us with a lot of libraries for parsing HTML pages. The common ones are:

  • Beautiful soup in BS4
  • Etree in lxml (an XPath parsing library)

Beautiful soup is similar to jQuery’s selector. It finds elements through ID, CSS selector and tag. XPath mainly finds elements through the nested relationship of HTML nodes, which is a bit similar to the file path, for example:

#Get all tr tags under the table tag with ID tab
path = '//table[@id="tab"]//tr'
#Compare with file path
path = 'D:\Github\hexo\source\_posts'

Beauty soup and XPath are not good or bad. I talk about XPath because I think it is better to use it. I will talk about beautiful soup later if time permits.

Now, let’s start with XPath!

2、 The installation and use of XPath

  1. Install lxml Library

    pip install lxml
  2. Simple use

    Before using XPath, the etree class is imported to process the original HTML page to get a_ Element object

    We can go through_ Element object to use XPath

    #Import etree class
    from lxml import etree
    
    
    #HTML text as an example
    html = '''
                    
                        
                            
                                
                                    
                                        Click me
                                    
                                
                            
                        
                    
                '''
    
    #Process the HTML text to get a_ Element object
    dom = etree.HTML(html)
    
    #Get the text under the a tag
    a_text = dom.xpath('//div/div/div/div/div/a/text()')
    
    print(a_text)

    Print results:

     
    result-1

    Friends familiar with HTML know that all tags in HTML are nodes. An HTML document is a document node, a document node contains a node tree, also known as DOM tree.

    Nodes in the node tree have hierarchical relationships with each other.

    Terms such as parent, child, and sibling are used to describe these relationships. The parent node has children. Siblings (siblings) are called siblings.

    • In the node tree, the top node is called the root
    • Every node has a parent, except for the root (it has no parent)
    • A node can have any number of children
    • Siblings are nodes that have the same parent

    from w3school:http://www.w3school.com.cn/htmldom/dom_nodes.asp

    In addition, we call the child node nearest to a node as its direct child node. As shown in the figure below, body and head are the direct child nodes of HTML

     
    DOM tree w3school

    After understanding the HTML structure, let’s look at the use of XPath.

    First of all, we passed the etree.HTML () to generate a_ Element object, etree.HTML () processes the incoming text as an HTML document node. In this way, we can always get a document node_ Element object.

  3. XPath syntax
    • A / B: ‘/’ is a hierarchical relationship in XPath. A on the left is the parent node, B on the right is the child node, where B is the direct child of A

    • A / / B: two / means to select all B nodes under node a (either direct child nodes or not). In the above example, we want to select a. the tag is written as follows

      a_text = dom.xpath('//div/div/div/div/div/a/text()')
      #Using//
      a_text = dom.xpath('//div//a/text()')
      #If there are two a tags under the div tag, then both a tags will be selected (note that the two a tags are not necessarily sibling nodes)
      #For example, in the following example, both a tags are selected because they are children of Div
             '''
                      
                          
                              
                                 
                                          Click me
                                  
                                  
                                      
                                          Click me
                                      
                                  
                              
                          
                      
                  '''
    • [@]: select a node with an attribute

      • //Div [@ classes], / / a [@ x]: select div node with class attribute and a node with X attribute
      • //Div [@ class = “container”]: select the div node with the class attribute value of container
    • //A [contains (text (), “dot”): select the a tags with “points” in the text content, such as the two a tags in the above example

    • //A [contains (@ ID, “ABC”): select the a tag with ABC in the ID attribute, such as

      #Both of these XPath rules can select the two a tags in the example
      path = '//a[contains(@href, "#123")]'
      path = '//a[contains(@href, "#1233")]'
    • //A [contains (@ y, “X”): select a tag with y attribute and Y attribute containing x value

summary

  1. HTML documents must be processed before using XPath
  2. All objects in the HTML DOM tree are nodes, including text, so text() is actually to get the text node under a tag
  3. Through_ The XPath method of the element object to use XPath
  4. Attention_ Element.xpath (path) always returns a list

If you have any questions, please feel free to comment. By the way, I would like to recommend a full range of resources for Python learning Seven clothes nine seven seven bar five (Digital homophony) conversion can be found, here are senior programmers to share previous learning experience, learning notes, as well as work experience of first-line enterprises, and carefully collate a copy of Python zero foundation to project actual combat materials, explain the latest technology of python, prospects, learning need to leave a message of small details

The text and pictures of this article are from the Internet and my own ideas. They are for study and communication only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

 
 

15 people like it

 
Python crawler series

 

 

Recommended Today

Understand mybatis step by step through the project

Reprint please be sure to indicate the source, original is not easy! Related articles:Understand mybatis < 1 > step by step through the project All code address of the project:Github-Mybatis Mybatis solves the problem of JDBC programming 1. The frequent creation and release of database links results in the waste of system resources, which affects […]