Implementation of parsing HTML with beautifulsoup in Python

Time:2020-10-1

abstract

Beautiful soup is a python library that can extract data from HTML or XML format files. It can parse HTML or XML data into Python objects to facilitate processing through Python code.

Document environment

  • Centos7.5
  • Python2.7
  • BeautifulSoup4

How to use beautifu soup

The basic function of beautiful soup is to find and edit HTML tags.

Basic concepts – object types

Beautiful soup transforms a complex HTML document into a complex tree structure, and each node is transformed into a python object. Beautiful soup defines these objects into four types: tag, navigablestring, beautiful soup and comment.

object type describe
BeautifulSoup The entire content of the document
Tag HTML tags
NavigableString Included text labels
Comment When the navigable tag is a special type of navigable, the

Installation and reference

# Beautiful Soup
pip install bs4

#Parser
pip install lxml
pip install html5lib
#Initialization
from bs4 import BeautifulSoup

#Method one: open the file directly
soup = BeautifulSoup(open("index.html"))

#Method 2: specify data
resp = "<html>data</html>"
soup = BeautifulSoup(resp, 'lxml')

#Soup is a beautifulsoup type object
print(type(soup))

Tag search and filtering

Basic method

Tag search has find_ All() and find() are two basic search methods, find_ The all() method returns a list of tags for all matching keywords, while the find() method returns only one matching result.

soup = BeautifulSoup(resp, 'lxml')

#Returns a tag named "a"
soup.find("a")

#Return to all tags list
soup.find_all("a")

## find_ The all method can be abbreviated
soup("a")

#Find all tags that start with B
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)

#Find all the tags in the list
soup.find_all(["a", "p"])

#Find tag name P and class attribute "title"
soup.find_all("p", "title")

#Find the attribute ID of "link2"
soup.find_all(id="link2")

#Find the
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#Find label text contains "sisters"
soup.find(string=re.compile("sisters"))

#Gets the specified number of results
soup.find_all("a", limit=2)

#Custom matching method
def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

#Use custom matching methods only for attributes
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

#Call find of tag_ When using the all() method, beautiful soup will retrieve all descendant nodes of the current tag. If you want to search only the direct child nodes of the tag, you can use the parameter recursive = false 

soup.find_all("title", recursive=False)

Extension method

ind_parents() All parent nodes
find_parent() First parent node
find_next_siblings() All sibling nodes after
find_next_sibling() First sibling node after
find_previous_siblings() All previous sibling nodes
find_previous_sibling() Previous first sibling node
find_all_next() All elements after
find_next() First element after
find_all_previous() All previous elements
find_previous() First element before

CSS selector

Beautiful soup supports most CSS selectors http://www.w3.org/TR/CSS2/selector.html , pass in the string parameter in the. Select() method of tag or beautifulsoup object to find tag using the syntax of CSS selector.

html_doc = """
<html>
<head>
 <title>The Dormouse's story</title>
</head>
<body>
 <p><b>The Dormouse's story</b></p>

 <p>
  Once upon a time there were three little sisters; and their names were
  <a href="http://example.com/elsie" rel="external nofollow">Elsie</a>,
  <a href="http://example.com/lacie" rel="external nofollow">Lacie</a>
  and
  <a href="http://example.com/tillie" rel="external nofollow">Tillie</a>;
  and they lived at the bottom of a well.
 </p>

 <p>...</p>
"""

soup = BeautifulSoup(html_doc)

#All a Tags
soup.select("a")

#Layer by layer search
soup.select("body a")
soup.select("html head title")

#Direct sub tags under tag tag
soup.select("head > title")
soup.select("p > #link1")

#Sibling tags after all matching Tags
soup.select("#link1 ~ .sister")

#First sibling tag after matching tag
soup.select("#link1 + .sister")

#According to calss class name
soup.select(".sister")
soup.select("[class~=sister]")

#Search by ID
soup.select("#link1")
soup.select("a#link1")

#Find by multiple IDS
soup.select("#link1,#link2")

#Find by property
soup.select('a[href]')

#Find by property value
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

#Get only one match
soup.select(".sister", limit=1)

#Get only one match
soup.select_one(".sister")

Label object method

Label properties

soup = BeautifulSoup('<p>Extremely bold</p><p>Extremely bold2</p>')
#Get all P tag objects
tags = soup.find_all("p")
#Gets the first p tag object
tag = soup.p
#Output label type 
type(tag)
#Tag name
tag.name
#Label properties
tag.attrs
#Label propertiesclass 的值
tag['class']
#The text content contained in the label, and the content of the object navigablestring
tag.string

#Returns all text in the label
for string in tag.strings:
  print(repr(string))

#Returns all text in the label, 并去掉空行
for string in tag.stripped_strings:
  print(repr(string))

#Get all the navigablestring contents contained in the tag and the descendant tags, and output them in Unicode string format
tag.get_text()
##Separated by "|"
tag.get_text("|")
##Separated by "|",不输出空字符
tag.get_text("|", strip=True)
Get child node
tag.contents  #Returns the list of the first level child nodes
tag.children  #Returns the listiterator object of the first level child node
for child in tag.children:
  print(child)

tag.descendants  #Recursively returns all child nodes
for child in tag.descendants:
  print(child)

Get parent node

tag.parent  #Returns the first level parent node label
tag.parents  #All the parent nodes of the element are obtained recursively

for parent in tag.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)

Get sibling node

#Next sibling element
tag.next_sibling 

#All sibling elements after the current tag
tag.next_siblings
for sibling in tag.next_siblings:
  print(repr(sibling))

#Previous sibling element
tag.previous_sibling

#All sibling elements before the current tag
tag.previous_siblings
for sibling in tag.previous_siblings:
  print(repr(sibling))

Traversal of elements

Each tag is defined as an “element” in beautiful soup, and each “element” is arranged in HTML from top to bottom. Tags can be displayed one by one through traversal commands

#The next element of the current tag
tag.next_element

#All elements after the current label
for element in tag.next_elements:
  print(repr(element))

#The previous element of the current tag
tag.previous_element
#All elements before the current label
for element in tag.previous_elements:
  print(repr(element))

Modify label properties


soup = BeautifulSoup('<b>Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "New link text."
print(tag)

Modify label content (navigablestring)


soup = BeautifulSoup('<b>Extremely bold</b>')
tag = soup.b
tag.string = "New link text."

Add tag content (navigablestring)

soup = BeautifulSoup("<a>Foo</a>")
tag = soup.a
tag.append("Bar")
tag.contents

#Or

new_string = NavigableString("Bar")
tag.append(new_string)
print(tag)

Add comment

Annotation is a special navigablestring object, so it can also be added through the append() method.


from bs4 import Comment
soup = BeautifulSoup("<a>Foo</a>")
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
print(tag)

Add tag

There are two ways to add tags, one is to add the inside of the specified tag (append method), the other is to add in the specified location (insert, insert)_ before、insert_ After method)

Append method


soup = BeautifulSoup("<b></b>")
tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com" rel="external nofollow" )
new_tag.string = "Link text."
tag.append(new_tag)
print(tag)

*The insert method is to insert an object (tag or navigablestring) in the specified position of the current tag child node list


html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents

insert_ Before() and insert_ The after () method adds elements to the sibling nodes before or after the current tag


html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.insert_before(tag)
soup.b

*Wrap() and unwrap() can wrap or unpack the specified tag element and return the wrapped result.

```python
#Add packaging
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#Output < b > I wish I was bold</b>

soup.p.wrap(soup.new_tag("div"))
#Output < div > < p > < b > I wish I was bold. < / b > < / P > < / div > output

#Unpacking
markup = '<a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#Output < A=“ http://example.com/ " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to  example.com </a>

delete a tap

html = '<b><a href="http://example.com/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
#Clear all child nodes of the current label
soup.b.clear()

#Remove the current label and all child nodes from soup and return the current label.
b_tag=soup.b.extract()
b_tag
soup

#Remove the current label and all child nodes from soup, no return.
soup.b.decompose()

#Replaces the current label with the specified element
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "Don't"
tag.replace_with(new_tag)

Other methods

output

#Format output
tag.prettify()
tag.prettify("latin-1")
  • After parsing with beautiful soup, all documents are converted to Unicode, and special characters are also converted to Unicode. If the document is converted to a string, the Unicode encoding will be encoded into UTF-8. In this way, HTML special characters cannot be displayed correctly
  • When using Unicode, beautiful soup also intelligently converts “quotation marks” into special characters in HTML or XML

Document coding

After parsing with beautiful soup, all documents are converted to Unicode, which uses the “automatic encoding detection” sub library to identify the current document encoding and convert it to unicode encoding.

soup = BeautifulSoup(html)
soup.original_encoding

#You can also specify the encoding of the document manually 
soup = BeautifulSoup(html, from_encoding="iso-8859-8")
soup.original_encoding

#In order to improve the detection efficiency of automatic code detection, some codes can be excluded in advance
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
When outputting a document through beautiful soup, the default output encoding is UTF-8, regardless of the encoding method of the input document
Document parser
Beautiful soup currently supports "lxml", "html5lib", and“ html.parser "

soup=BeautifulSoup("<a><b /></a>")
soup
#Output: < HTML > < body > < a > < b ></b></a></body></html>
soup=BeautifulSoup("<a></p>", "lxml")
soup
#Output: < HTML > < body > < a ></a></body></html>
soup=BeautifulSoup("<a></p>", "html5lib")
soup
#Output: < HTML > < head > < head > < body > < a > < p > output</p></a></body></html>
soup=BeautifulSoup("<a></p>", "html.parser")
soup
#Output: < a > 0</a>

Reference documents
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

The above is the whole content of this article, I hope to help you in your study, and I hope you can support developeppaer more.