Through the previous article, we have known how to obtain the web page and download the file, but the previous web page we obtained is unprocessed, redundant information is too much, can not be analyzed and used
In this section, we’ll learn how to filter the information you need from the web page, and recommend a full-fledged Python learning solution Seven clothes nine seven seven bar five (Digital homophony) conversion can be found, here are senior programmers to share previous learning experience, learning notes, as well as work experience of first-line enterprises, and carefully collate a copy of Python zero foundation to project actual combat materials, explain the latest technology of python, prospects, learning need to leave a message of small details
When it comes to information filtering, we immediately think of regular expressions, but we won’t talk about them today. Because for crawlers, regular expressions are too complex to be friendly to novices. Moreover, regular expressions have poor fault tolerance. If the webpage is slightly changed, the matching expressions have to be rewritten. In addition, the readability of regular expressions is almost zero.
Of course, this is not to say that regularization is not good, just that it is not suitable for reptiles and novices. In fact, regularization is very powerful. We will use it in the following data cleaning.
Since regularization doesn’t work, what should be used? Don’t worry, python provides us with a lot of libraries for parsing HTML pages. The common ones are:
- Beautiful soup in BS4
- Etree in lxml (an XPath parsing library)
Beautiful soup is similar to jQuery’s selector. It finds elements through ID, CSS selector and tag. XPath mainly finds elements through the nested relationship of HTML nodes, which is a bit similar to the file path, for example:
#Get all tr tags under the table tag with ID tab path = '//table[@id="tab"]//tr' #Compare with file path path = 'D:\Github\hexo\source\_posts'
Beauty soup and XPath are not good or bad. I talk about XPath because I think it is better to use it. I will talk about beautiful soup later if time permits.
Now, let’s start with XPath!
2、 The installation and use of XPath
Install lxml Library
pip install lxml
Before using XPath, the etree class is imported to process the original HTML page to get a_ Element object
We can go through_ Element object to use XPath
#Import etree class from lxml import etree #HTML text as an example html = ''' Click me ''' #Process the HTML text to get a_ Element object dom = etree.HTML(html) #Get the text under the a tag a_text = dom.xpath('//div/div/div/div/div/a/text()') print(a_text)
Friends familiar with HTML know that all tags in HTML are nodes. An HTML document is a document node, a document node contains a node tree, also known as DOM tree.
Nodes in the node tree have hierarchical relationships with each other.
Terms such as parent, child, and sibling are used to describe these relationships. The parent node has children. Siblings (siblings) are called siblings.
- In the node tree, the top node is called the root
- Every node has a parent, except for the root (it has no parent)
- A node can have any number of children
- Siblings are nodes that have the same parent
In addition, we call the child node nearest to a node as its direct child node. As shown in the figure below, body and head are the direct child nodes of HTMLDOM tree w3school
After understanding the HTML structure, let’s look at the use of XPath.
First of all, we passed the etree.HTML () to generate a_ Element object, etree.HTML () processes the incoming text as an HTML document node. In this way, we can always get a document node_ Element object.
A / B: ‘/’ is a hierarchical relationship in XPath. A on the left is the parent node, B on the right is the child node, where B is the direct child of A
A / / B: two / means to select all B nodes under node a (either direct child nodes or not). In the above example, we want to select a. the tag is written as follows
a_text = dom.xpath('//div/div/div/div/div/a/text()') #Using// a_text = dom.xpath('//div//a/text()') #If there are two a tags under the div tag, then both a tags will be selected (note that the two a tags are not necessarily sibling nodes) #For example, in the following example, both a tags are selected because they are children of Div ''' Click me Click me '''
[@]: select a node with an attribute
- //Div [@ classes], / / a [@ x]: select div node with class attribute and a node with X attribute
- //Div [@ class = “container”]: select the div node with the class attribute value of container
//A [contains (text (), “dot”): select the a tags with “points” in the text content, such as the two a tags in the above example
//A [contains (@ ID, “ABC”): select the a tag with ABC in the ID attribute, such as
#Both of these XPath rules can select the two a tags in the example path = '//a[contains(@href, "#123")]' path = '//a[contains(@href, "#1233")]'
//A [contains (@ y, “X”): select a tag with y attribute and Y attribute containing x value
- HTML documents must be processed before using XPath
- All objects in the HTML DOM tree are nodes, including text, so text() is actually to get the text node under a tag
- Through_ The XPath method of the element object to use XPath
- Attention_ Element.xpath (path) always returns a list
If you have any questions, please feel free to comment. By the way, I would like to recommend a full range of resources for Python learning Seven clothes nine seven seven bar five (Digital homophony) conversion can be found, here are senior programmers to share previous learning experience, learning notes, as well as work experience of first-line enterprises, and carefully collate a copy of Python zero foundation to project actual combat materials, explain the latest technology of python, prospects, learning need to leave a message of small details
The text and pictures of this article are from the Internet and my own ideas. They are for study and communication only. They do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
15 people like it