09 XPath language Python crawler

Time:2021-1-14

XPath language

XPath (XML path language) is an XML path language, which is used to locate a part of an XML document.

Learning purpose

After transforming the HTML into an XML document, use XPath to find the HTML node or element

For example, a ‘/’ is used to separate the upper and lower levels. The first ‘/’ represents the root node of the document (note that it does not refer to the outermost tag node of the document, but refers to the document itself).

For example, for an HTML file, the outermost node should be “/ HTML”.

XPath development tool

  • An open source tool for editing XPath expressions: xmlquire
  • Chrome plug-in XPath helper
  • Enter $X (“XPath selector”) directly in the console

09 XPath language Python crawler

  • XPath checker, a firebox plug-in

09 XPath language Python crawler

XPath syntax reference document:

http://www.w3school.com.cn/xp…

XPath syntax

XPath is a language for finding information in XML documents.

XPath can be used to traverse elements and attributes in an XML document.

<?xml version="1.0" encoding="ISO-8859-1"?><bookstore><book>  <title lang="eng">Harry Potter</title>  <price>29.99</price></book><book>  <title lang="eng">Learning XML</title>  <price>39.95</price></book></bookstore>

Select node XPath uses path expressions to select nodes in an XML document. Nodes are selected along the path or step.

The most useful path expressions are listed below:

expression describe
/ Select from the root node.
nodename Select theAll child nodes
// Select from current nodeAll matchesNodes in a document
. Select the current node.
.. Select the parent of the current node.
@ Select properties.

example

In the following table, we have listed some path expressions and their results:

Path expression result
/bookstore Select the root element book. Note: if the path starts with a forward slash (/), it always represents the absolute path to an element!
bookstore Select all the child nodes of the book element. Select from the root node by default
bookstore/book Select all book elements that belong to the child elements of the bookstore.
//book Select all book child elements regardless of their location in the document.
//book/./title Select all the book sub elements to find the title node from the current node
//price/.. Select all the child elements of the book to find the parent node from the current node
bookstore//book Select all book elements that are descendants of the book element, regardless of where they are under the book.
//@lang Select all the properties named Lang.

09 XPath language Python crawler

Predicate conditions

  • The predicate is used to findA specific messageperhapsContains a specified valueNode of.
  • The so-called “predicate condition” is the additional condition to the path expression
  • The predicate isEmbedded in square bracketsAll of them are written in square brackets “[]” to further filter the nodes.

example

In the following table, we list some path expressions with predicates and the results of the expressions:

Path expression result
/bookstore/book[1] Select the first book element that belongs to the book child element.
/bookstore/book[last()] Select the last book element that belongs to the book child element.
/bookstore/book[last()-1] Select the penultimate book element that belongs to the child element of the bookstore.
/bookstore/book[position()<3] Select the first two book elements that belong to the child elements of the book element.
//title[@lang] Select all the title elements that have attributes named Lang.
//title[@lang=’eng’] All title elements are selected, and these elements have Lang attributes with the value Eng.
//book[price] Select all book elements, and the selected book element must have a price child element
/bookstore/book[price>35.00] Select all book elements of the book element, and the value of the price element must be greater than 35.00.
/bookstore/book[price>35.00]/title Select all the title elements of the book element in the book element, and the value of the price element must be greater than 35.00.

09 XPath language Python crawler

Select unknown node

XPath wildcards can be used to select unknown XML elements.

wildcard describe
* Match any element node.
@ Match any attribute node.

example

In the following table, we list some path expressions and their results:

Path expression result
/bookstore/ Select all the child elements of the book element.
// Select all elements in the document.
//title[@*] Select all the title elements with attributes.
  • Select several paths

You can select several paths by using the “|” operator in the path expression.

example

In the following table, we list some path expressions and their results:

Path expression result
//book/title //book/price Select all the title and price elements of the book element.
//title //price Select all the title and price elements in the document.
/bookstore/book/title //price Select all the title elements of the book element belonging to the book element and all the price elements in the document.

Advanced usage of XPath

  • Fuzzy query contains

At present, many web frameworks generate the element ID of the interface dynamically, so the ID changes every time the same interface is operated, which has a certain impact on automatic testing.

<div class="eleWrapper"> <input type="text" class="textfield" name="ID9sLJQnkQyLGLhYShhlJ6gPzHLgvhpKpLzp2Tyh4hyb1b4pnvzxFR!-166749344!1357374592067" id="nt1357374592068" /> </div>

The solution uses the matching function of XPath,//input[contains(@id,'nt')]

  • XML used for testing
<Root>​
    <Person ID="1001" >
        < name lang = "zh CN" > Zhang Chengbin < / name >
        <Email xmlns="www.quicklearn.cn" > [email protected] </Email>​          
        <Blog>http://cbcye.cnblogs.com</Blog>
    ​</Person>​
    <Person ID="1002" >
       ​<Name lang="en" >Gary Zhang</Name>​
      <Email xmlns="www.quicklearn.cn" > [email protected]</Email>​    
       <Blog>http://www.quicklearn.cn</Blog>​
    </Person>​
</Root>
  • Query the person node XPath expression with CN string in all blog node values
/Root//Person[contains(Blog,'cn')]

Query the person node with CN string in all blog node values and 01 in attribute ID values

XPath expression:

/Root//Person[contains(Blog,'cn') and contains(@ID,'01')]

Study notes

1. Depending on their own attributes, text positioning

//TD [text() ='data import '] // div [contains (@ class,' CuX rightarrowicon on ')] // a [text() ='register now'] // input [@ type ='radio 'and @ value ='1']    
 Multiple conditions
​//span[@name='bruce'][text()='bruce1'][1]   
Multiple conditions​
//span[@id='bruce1' or text()='bruce2']  
Find multiple
​//span[text()='bruce1' and text()='bruce2']

2. Rely on the parent node to locate

//div[@class='x-grid-col-name x-grid-cell-inner']/div​//div[@id='dynamicGridTestInstanceformclearuxformdiv']/div​//div[@id='test']/input

3. Rely on sub nodes to locate

//div[div[@id='navigation']]​
//div[div[@name='listType']]​
//div[p[@name='testname']]

4. Mixed type

//div[div[@name='listType']]
//img​
//TD [A / font [contains (text(),'seleium2 video from zero ')]]
//input[@type='checkbox']

5. Advanced part

//input[@id='123']/following-sibling::input   
Find the next sibling node

​//input[@id='123']/preceding-sibling::span    
Last sibling node

​//input[starts-with(@id,'123')]               
What does it start with

//span[not(contains(text(),'xpath'))]        
Span without XPath field

6. Index

//div/input[2]​
//div[@id='position']/span[3]​
//div[@id='position']/span[position()=3]​
//div[@id='position']/span[position()>3]
//div[@id='position']/span[position()<3]​
//div[@id='position']/span[last()]
​//div[@id='position']/span[last()-1]

7. Substring interception judgment

<div data-for="result" id="swfEveryCookieWrap"></div>​
//*[substring (@ ID, 4,5) ='every '] / @ ID intercepts the attribute to locate 3 and takes the character of length 5
//*[substring (@ ID, 4) ='everycookiewrap '] intercepts the attribute from positioning 3 to the last character
// * [substring before (@ ID, 'C') ='swfeverry '] / @ character matching before attribute' C '
// * [substring after (@ ID, 'C') ='ookiewrap '] / @ character matching after attribute' C

8. Wildcard*

//span[@*='bruce']​
//*[@name='bruce']

9. Shaft

//div[span[text()='+++current node']]/parent::div    
Find parent node

//div[span[text()='+++current node']]/ancestor::div    
Find ancestor node

10. Sun Tzu node

//div[span[text()='current note']]/descendant::div/span[text()='123']
​//div[span[text()='current note']]
//div/span[text()='123']         
 The two expressions have the same meaning

11.following pre

// span[@class="fk fk_cur"]/../following::a       
All a below

// span[@class="fk fk_cur"]/../preceding::a[1]    
All a's up

XPath extracts text under multiple tags

When writing crawlers, you often use XPath to extract data. For the following code:

Hello, everyone! </div>

Using XPath extraction is very convenient. Suppose the source code of the web page is in the selector:

data = selector.xpath('//div[@id="test1"]/text()').extract()[0]

You can put “Hello everyone!” Extract it into the data variable.

But what if you run into the following code?

< div id = "test2" > beauty, < font color = Red > how much is your wechat? </font><div>

If used:

data = selector.xpath('//div[@id="test2"]/text()').extract()[0]

Only “beauty” can be extracted;

If used:

data = selector.xpath('//div[@id="test2"]/font/text()').extract()[0]

You can only extract “how much is your wechat?”

But my original intention is to “beauty, how much is your wechat?” This whole sentence is extracted.

<div id="test3">
    I'm Zuo Qinglong,
    < span id = "tiger" > right white tiger,
    < UL > on the rosefinch,
        <li>Go down to Xuanwu. </li>
    < / UL > Lao Niu is in the middle</span>
        The tap is in the chest.
<div>

Moreover, the internal tags are not fixed. If I have 100 pieces of similar HTML code, how can I use XPath expressions to extract them in the fastest and most convenient way?

Use string (.) of XPath

Take the third code as an example

data = selector.xpath('//div[@id="test3"]')
info = data.xpath('string(.)').extract()[0]

In this way, we can “I left green dragon, right white tiger, up rosefinch, down Xuanwu.”. The whole sentence is extracted and assigned to the info variable.

09 XPath language Python crawler

Introduction to it | thank you for your attention | practice address:www.520mg.com/it

Recommended Today

General method of Tkinter (21) components

method explain after(delay_ms, callback=None, *args) At least delay_ Ms after calling callback, no callback, equivalent time.sleep (); returns an ID to cancel after_ The cancel () method uses after_cancel(id) Cancel the callback of after method call after_idle(func, *args) Similar to the after method, but called when there is no event idle bell() A beep bind(sequence=None, […]