A tutorial on using the nokogiri package to manipulate data in XML format in Ruby

Time:2021-12-29

install

For Ubuntu, you need to install libxml2 and libxslt:

?
1
$ apt-get install libxml2 libxslt

Then you can:

?
1
$ gem install nokogiri

Optional
Nokogiri provides some options for parsing files, including:

  • Noblinks: delete empty nodes
  • Noent: override entities
  • Noerror: hide error reports
  • Strict: precise parsing. An error is thrown when parsing to a file exception
  • Nonet: disable any network connections during parsing

Examples of optional usage (via block call):

?
1
2
3
doc = Nokogiri::XML(File.open("blossom.xml")) do |config|
config.strict.nonet
end

perhaps

?
1
2
3
doc = Nokogiri::XML(File.open("blossom.xml")) do |config|
config.options = Nokogiri::XML::ParseOptions::STRICT | Nokogiri::XML::ParseOptions::NONET
end

analysis

It can be parsed from files, strings, URLs, etc. It relies on these two methods: nokogiri:: HTML and nokogiri:: XML:

Read string:

?
1
2
html_doc = Nokogiri::HTML("<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>")
xml_doc = Nokogiri::XML("<root><aliens><alien><name>Alf</name></alien></aliens></root>")

Read file:

?
1
2
3
f = File.open("blossom.xml")
doc = Nokogiri::XML(f)
f.close

Read URL:

?
1
2
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.threescompany.com/"))

Find node

You can use XPath and CSS selector to search: for example, given an XML:

?
1
2
3
4
5
6
7
8
<books>
 <book>
 <title>Stars</title>
 </book>
 <book>
 <title>Moon</title>
 </book>
</books>

xpath:

?
1
@doc.xpath("//title")

css:

?
1
@doc.css("book title")

Modify node content

?
1
2
3
4
5
6
7
8
title = @doc.css("book title").firsto
title.content = 'new title'
puts @doc.to_html
 
# =>
...
 <title>new title</title>
...

Modify the structure of nodes

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
first_title = @doc.at_css('title')
second_book = @doc.css('book').last
 
#You can put the first title in the second book
first_title.parent = second_book
 
#Can also be placed at will.
second_book.add_next_sibling(first_title)
 
#You can also modify the corresponding class
first_title.name = 'h2'
first_title['class']='red_color'
puts @doc.to_html
# => <h2 class='red_color'>...</h2>
 
#You can also create a new node
third_book = Nokogiri::XML::Node.new 'book', @doc
third_book.content = 'I am the third book'
second_book.add_next_sibling third_book
puts @doc.to_html
# =>
...
<books>
 ...
 <book>I am the third book</book>
</books>