XPath language
XPath (XML path language) is an XML path language, which is used to locate a part of an XML document.
Learning purpose
After transforming the HTML into an XML document, use XPath to find the HTML node or element
For example, a ‘/’ is used to separate the upper and lower levels. The first ‘/’ represents the root node of the document (note that it does not refer to the outermost tag node of the document, but refers to the document itself).
For example, for an HTML file, the outermost node should be “/ HTML”.
XPath development tool
- An open source tool for editing XPath expressions: xmlquire
- Chrome plug-in XPath helper
- Enter $X (“XPath selector”) directly in the console
- XPath checker, a firebox plug-in
XPath syntax reference document:
http://www.w3school.com.cn/xp…
XPath syntax
XPath is a language for finding information in XML documents.
XPath can be used to traverse elements and attributes in an XML document.
<?xml version="1.0" encoding="ISO-8859-1"?><bookstore><book> <title lang="eng">Harry Potter</title> <price>29.99</price></book><book> <title lang="eng">Learning XML</title> <price>39.95</price></book></bookstore>
Select node XPath uses path expressions to select nodes in an XML document. Nodes are selected along the path or step.
The most useful path expressions are listed below:
expression | describe |
---|---|
/ | Select from the root node. |
nodename | Select theAll child nodes。 |
// | Select from current nodeAll matchesNodes in a document |
. | Select the current node. |
.. | Select the parent of the current node. |
@ | Select properties. |
example
In the following table, we have listed some path expressions and their results:
Path expression | result |
---|---|
/bookstore | Select the root element book. Note: if the path starts with a forward slash (/), it always represents the absolute path to an element! |
bookstore | Select all the child nodes of the book element. Select from the root node by default |
bookstore/book | Select all book elements that belong to the child elements of the bookstore. |
//book | Select all book child elements regardless of their location in the document. |
//book/./title | Select all the book sub elements to find the title node from the current node |
//price/.. | Select all the child elements of the book to find the parent node from the current node |
bookstore//book | Select all book elements that are descendants of the book element, regardless of where they are under the book. |
//@lang | Select all the properties named Lang. |
Predicate conditions
- The predicate is used to findA specific messageperhapsContains a specified valueNode of.
- The so-called “predicate condition” is the additional condition to the path expression
- The predicate isEmbedded in square bracketsAll of them are written in square brackets “[]” to further filter the nodes.
example
In the following table, we list some path expressions with predicates and the results of the expressions:
Path expression | result |
---|---|
/bookstore/book[1] | Select the first book element that belongs to the book child element. |
/bookstore/book[last()] | Select the last book element that belongs to the book child element. |
/bookstore/book[last()-1] | Select the penultimate book element that belongs to the child element of the bookstore. |
/bookstore/book[position()<3] | Select the first two book elements that belong to the child elements of the book element. |
//title[@lang] | Select all the title elements that have attributes named Lang. |
//title[@lang=’eng’] | All title elements are selected, and these elements have Lang attributes with the value Eng. |
//book[price] | Select all book elements, and the selected book element must have a price child element |
/bookstore/book[price>35.00] | Select all book elements of the book element, and the value of the price element must be greater than 35.00. |
/bookstore/book[price>35.00]/title | Select all the title elements of the book element in the book element, and the value of the price element must be greater than 35.00. |
Select unknown node
XPath wildcards can be used to select unknown XML elements.
wildcard | describe |
---|---|
* | Match any element node. |
@ | Match any attribute node. |
example
In the following table, we list some path expressions and their results:
Path expression | result |
---|---|
/bookstore/ | Select all the child elements of the book element. |
// | Select all elements in the document. |
//title[@*] | Select all the title elements with attributes. |
- Select several paths
You can select several paths by using the “|” operator in the path expression.
example
In the following table, we list some path expressions and their results:
Path expression | result | |
---|---|---|
//book/title | //book/price | Select all the title and price elements of the book element. |
//title | //price | Select all the title and price elements in the document. |
/bookstore/book/title | //price | Select all the title elements of the book element belonging to the book element and all the price elements in the document. |
Advanced usage of XPath
- Fuzzy query contains
At present, many web frameworks generate the element ID of the interface dynamically, so the ID changes every time the same interface is operated, which has a certain impact on automatic testing.
<div class="eleWrapper"> <input type="text" class="textfield" name="ID9sLJQnkQyLGLhYShhlJ6gPzHLgvhpKpLzp2Tyh4hyb1b4pnvzxFR!-166749344!1357374592067" id="nt1357374592068" /> </div>
The solution uses the matching function of XPath,//input[contains(@id,'nt')]
- XML used for testing
<Root>
<Person ID="1001" >
< name lang = "zh CN" > Zhang Chengbin < / name >
<Email xmlns="www.quicklearn.cn" > [email protected] </Email>
<Blog>http://cbcye.cnblogs.com</Blog>
</Person>
<Person ID="1002" >
<Name lang="en" >Gary Zhang</Name>
<Email xmlns="www.quicklearn.cn" > [email protected]</Email>
<Blog>http://www.quicklearn.cn</Blog>
</Person>
</Root>
- Query the person node XPath expression with CN string in all blog node values
/Root//Person[contains(Blog,'cn')]
Query the person node with CN string in all blog node values and 01 in attribute ID values
XPath expression:
/Root//Person[contains(Blog,'cn') and contains(@ID,'01')]
Study notes
1. Depending on their own attributes, text positioning
//TD [text() ='data import '] // div [contains (@ class,' CuX rightarrowicon on ')] // a [text() ='register now'] // input [@ type ='radio 'and @ value ='1']
Multiple conditions
//span[@name='bruce'][text()='bruce1'][1]
Multiple conditions
//span[@id='bruce1' or text()='bruce2']
Find multiple
//span[text()='bruce1' and text()='bruce2']
2. Rely on the parent node to locate
//div[@class='x-grid-col-name x-grid-cell-inner']/div//div[@id='dynamicGridTestInstanceformclearuxformdiv']/div//div[@id='test']/input
3. Rely on sub nodes to locate
//div[div[@id='navigation']]
//div[div[@name='listType']]
//div[p[@name='testname']]
4. Mixed type
//div[div[@name='listType']]
//img
//TD [A / font [contains (text(),'seleium2 video from zero ')]]
//input[@type='checkbox']
5. Advanced part
//input[@id='123']/following-sibling::input
Find the next sibling node
//input[@id='123']/preceding-sibling::span
Last sibling node
//input[starts-with(@id,'123')]
What does it start with
//span[not(contains(text(),'xpath'))]
Span without XPath field
6. Index
//div/input[2]
//div[@id='position']/span[3]
//div[@id='position']/span[position()=3]
//div[@id='position']/span[position()>3]
//div[@id='position']/span[position()<3]
//div[@id='position']/span[last()]
//div[@id='position']/span[last()-1]
7. Substring interception judgment
<div data-for="result" id="swfEveryCookieWrap"></div>
//*[substring (@ ID, 4,5) ='every '] / @ ID intercepts the attribute to locate 3 and takes the character of length 5
//*[substring (@ ID, 4) ='everycookiewrap '] intercepts the attribute from positioning 3 to the last character
// * [substring before (@ ID, 'C') ='swfeverry '] / @ character matching before attribute' C '
// * [substring after (@ ID, 'C') ='ookiewrap '] / @ character matching after attribute' C
8. Wildcard*
//span[@*='bruce']
//*[@name='bruce']
9. Shaft
//div[span[text()='+++current node']]/parent::div
Find parent node
//div[span[text()='+++current node']]/ancestor::div
Find ancestor node
10. Sun Tzu node
//div[span[text()='current note']]/descendant::div/span[text()='123']
//div[span[text()='current note']]
//div/span[text()='123']
The two expressions have the same meaning
11.following pre
// span[@class="fk fk_cur"]/../following::a
All a below
// span[@class="fk fk_cur"]/../preceding::a[1]
All a's up
XPath extracts text under multiple tags
When writing crawlers, you often use XPath to extract data. For the following code:
Hello, everyone! </div>
Using XPath extraction is very convenient. Suppose the source code of the web page is in the selector:
data = selector.xpath('//div[@id="test1"]/text()').extract()[0]
You can put “Hello everyone!” Extract it into the data variable.
But what if you run into the following code?
< div id = "test2" > beauty, < font color = Red > how much is your wechat? </font><div>
If used:
data = selector.xpath('//div[@id="test2"]/text()').extract()[0]
Only “beauty” can be extracted;
If used:
data = selector.xpath('//div[@id="test2"]/font/text()').extract()[0]
You can only extract “how much is your wechat?”
But my original intention is to “beauty, how much is your wechat?” This whole sentence is extracted.
<div id="test3">
I'm Zuo Qinglong,
< span id = "tiger" > right white tiger,
< UL > on the rosefinch,
<li>Go down to Xuanwu. </li>
< / UL > Lao Niu is in the middle</span>
The tap is in the chest.
<div>
Moreover, the internal tags are not fixed. If I have 100 pieces of similar HTML code, how can I use XPath expressions to extract them in the fastest and most convenient way?
Use string (.) of XPath
Take the third code as an example
data = selector.xpath('//div[@id="test3"]')
info = data.xpath('string(.)').extract()[0]
In this way, we can “I left green dragon, right white tiger, up rosefinch, down Xuanwu.”. The whole sentence is extracted and assigned to the info variable.
Introduction to it | thank you for your attention | practice address:www.520mg.com/it