How to use XML format data processing library rexml in Ruby

Time:2021-12-30

Use re as a treeXML
REXMLThe purpose is just enough. To the greatest extent, it can complete the task well. In fact, rexml supports two different styles of XML processing – “tree” and “stream”. The first style is the simpler version that DOM is trying to do; The second style is the simpler version Sax is trying to do. Let’s study the tree style first. Suppose we want to extract the same address book document in the previous example. The following example is from the modified eval. Exe I created rb ; Standard eval Rb (link to)RubyTutorial) you can display very long calculations based on the evaluation of expressions on complex objects — my eval RB does not react without errors:
How to use rexml to reference nested data

?
1
2
3
4
5
6
ruby> require "rexml/document"
ruby> include REXML
ruby> addrbook = (Document.new File.new "address.xml").root
ruby> persons = addrbook.elements.to_a("//person")
ruby> puts persons[1].elements["address"].attributes["city"]
New York

This expression is very common to_ The a () method creates an array of all < person > elements in the document, which may be useful in other naming. Element is a bit like a DOM node, but it is actually closer to XML itself (and easier to use). The parameter of. To_a() is XPath, in which case you can identify all < person > elements anywhere in the document. If we only need the elements on the first layer, we can use:
Create an array of matching elements

?
1
ruby> persons = addrbook.elements.to_a("/addressbook/person")

We can even use XPath more directly as a Overloaded index of the elements property. For example:
Another way to reference nested data using rexml

?
1
2
ruby> puts addrbook.elements["//person[2]/address"].attributes["city"]
New York

Note that XPath uses a 1-based index, unlike Ruby and python arrays, which use a 0-based index. In other words, it is still the same person in the city we are checking. By looking at rexml, note that XPath uses a 1-based index, unlike Ruby and python arrays, which use a 0-based index. In other words, it is still the same person in the city we are checking. By viewing
XML source code for displaying elements with rexml

?
1
2
3
4
5
6
7
ruby> puts addrbook.elements["//person[2]/address"]
<address city='New York' street='118 St.' number='344' state='NY'/>
ruby> puts addrbook.elements["//person[2]/contact-info"]
<contact-info>
 <email address='[email protected]'/>
 <home-phone number='03-3987873'/>
</contact-info>

In addition, XPath does not have to match only one element. We have seen this when defining the persons array, but another example emphasizes this:
Match multiple elements to XPath

?
1
2
3
ruby> puts addrbook.elements.to_a("//person/address[@state='CA']")
<address city='Sacramento' street='Spruce Rd.' number='99' state='CA'/>
<address city='Los Angeles' street='Pine Rd.' number='1234' state='CA'/>

On the contrary The index of the elements attribute produces only the first matching element:
When XPath matches only the first occurrence

?
1
2
3
ruby> puts addrbook.elements.to_a("//person/address[@state='CA']")
<address city='Sacramento' street='Spruce Rd.' number='99' state='CA'/>
<address city='Los Angeles' street='Pine Rd.' number='1234' state='CA'/>

You can also use XPath addresses through the XPath class in rexml, which has functions such as first() 、 . Each () and Methods such as match ().
A unique idiom for rexml elements is Each iterator. Although ruby has a loop structure for manipulating collections, ruby programmers usually prefer to use iterator methods to pass control to code blocks. The following two structures are equivalent, but the second structure has a more natural ruby feel:
Iterate by matching XPath in rexml

?
1
2
3
4
5
6
7
8
9
10
ruby> for addr in addrbook.elements.to_a("//address[@state='CA']")
  |  puts addr.attributes["city"]
  | end
Sacramento
Los Angeles
ruby> addrbook.elements.each("//address[@state='CA']") {
  |  |addr| puts addr.attributes["city"]
  | }
Sacramento
Los Angeles

Using rexml as a stream
For the purpose of “just enough”, the tree method of rexml is probably the simplest method of ruby language. But rexml also provides a way of streaming, which is like a lighter variant of Sax. As with Sax, rexml does not provide application programmers with default data structures from XML documents. Instead, the “listener” or “handler” class is responsible for providing a set of methods to respond to various events in the document flow. The following are common Collections: start tags, end tags, element text encountered, and so on.
Although streaming is far from as easy as working in a tree, it is usually much faster. The rexml tutorial claims that streaming is 1500 times faster. Although I haven’t tried to benchmark it, I guess this is a limited case (my small example is also done instantaneously in tree mode). In short, if speed matters, the difference in speed is likely to be significant.
Let’s look at a very simple example that does the same thing as the “list California cities” example above. It is relatively simple to extend it for complex document processing:
Stream processing of XML documents in rexml

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
ruby> require "rexml/document"
ruby> require "rexml/streamlistener"
ruby> include REXML
ruby> class Handler
  |  include StreamListener
  def tag_start name, attrs
  |    if name=="address" and attrs.assoc("state")[1]=="CA"
  |     puts attrs.assoc("city")[1]
  |    end
  end
  | end
ruby> Document.parse_stream((File.new "address.xml"), Handler.new)
Sacramento
Los Angeles

One thing to note in the flow processing example is that tag attributes are passed as a set of arrays, which requires a little more work than hashing (but it may be faster to create in the Library).

Coding problem
All text nodes in rexml are encoded in UTF-8. All calling codes should pay attention to this. In the program, the string passed to rexml must be encoded in UTF-8.

Rexml cannot always correctly guess the encoding of your text, so it always assumes UTF-8 encoding. At the same time, rexml will not warn if you try to add text in other encoding methods. The adder must ensure that he adds the text of UTF-8. If you add standard ascii 7-bit encoding, it doesn’t matter. If iso8859-1 text is used, it must be converted to UTF-8 encoding before adding. You can use text unpack(“C”). pack(“U”)。 Change the code for output, only document Write () and document to_ S () support. If you need to output a specific encoded node, you must wrap the output object with output.

?
1
2
3
4
e = Element.new "<a/>"
e.text = "f\xfcr"  # ISO-8859-1 '??'
o = ''
e.write( Output.new( o, "ISO-8859-1" ) )

You can pass any supported encoding to the output.