Summarization of XML Processing Methods by Shell Parsing

Time:2019-9-11

Preface

A few days ago, when I was working, I met a need to parse and process XML files. Considering the complexity of the logic, I slowly did it with java. However, this requirement often changes. After each change, the code of the jar package should be found again. After the change, the original jar package should be replaced. First, it is inconvenient to modify, second, it is inconvenient to save the code uniformly, and third, it is inconvenient to view the function of the jar package.

In fact, for this more flexible function, the most convenient and efficient way is to use some scripting languages, such as python, Ruby and so on, which is efficient in development and can handle some complex logic. But for various reasons, some machines at work do not have interpreters for these languages installed. So we have to study a wave of methods to parse XML with shell scripts.

In the final analysis, the shell is not very suitable for dealing with complex logic, but for some simple search and replacement requirements, it is very convenient to use the shell.

Here I mainly use the following three tools:

  • xmllint
  • xpath
  • xml2

The following is a summary of the use of these three tools for future reference.

xmllint

Sketch

Xmllint is actually a small tool implemented by a C language library function called libxml2, so it has high efficiency, good support for different systems and complete functions. He generally belongs to the libxml 2-utils package, so it is similar to that of libxml 2-utils.sudo apt install libxml2-utilsThe command can be installed.

function

Xmllint supports at least the following common functions:

  • Support XPath query statements
  • Supporting interactive query of class shell
  • Support XML format validation
  • Support dtd, XSD for XML verification
  • Supporting Code Conversion
  • Support XML formatting
  • Supporting de-space compression
  • Support time efficiency statistics

In fact, our commonly used functions are mainly three – XPath query, space removal, formatting and validation.

For example, currently there is sample.xml:


<books>
  <book>
    <name>book1</name>
    <price>100</price>
  </book>
  <book>
    <name>book2</name>
    <price>200</price>
  </book>
  <book><name>book3</name><price>300</price>
  </book>
</books>

Execute XPath queries:


[email protected]:~$ xmllint --xpath "//book[@id=2]/name/text()" sample.xml
book2

Blank space:


[email protected]:~$ xmllint --noblanks sample.xml
<?xml version="1.0"?>
<books><book><name>book1</name><price>100</price><license/></book><book><name>book2</name><price>200</price></book><book><name>book3</name><price>300</price></book></books>

Format:


[email protected]:~$ xmllint --format sample.xml
<?xml version="1.0"?>
<books>
 <book>
 <name>book1</name>
 <price>100</price>
 <license/>
 </book>
 <book>
 <name>book2</name>
 <price>200</price>
 </book>
 <book>
 <name>book3</name>
 <price>300</price>
 </book>
</books>

XSD check:


[email protected]:~$ cat sample.xsd
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns="" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:msdata="urn:schemas-microsoft-com:xml-msdata">
 <xs:element name="books" msdata:IsDataSet="true" msdata:Locale="en-US">
 <xs:complexType>
  <xs:choice minOccurs="0" maxOccurs="unbounded">
  <xs:element name="book">
   <xs:complexType>
   <xs:sequence>
    <xs:element name="name" type="xs:string" minOccurs="0" msdata:Ordinal="0" />
    <xs:element name="price" type="xs:string" minOccurs="0" msdata:Ordinal="1" />
   </xs:sequence>
   <xs:attribute name="id" type="xs:string" />
   </xs:complexType>
  </xs:element>
  </xs:choice>
 </xs:complexType>
 </xs:element>
</xs:schema>
 
[email protected]:~$ xmllint --noout --schema sample.xsd sample.xml
sample.xml validates

Be careful:Check result information is output to stderr. By default, the tool will echo the original file to stdout. It can close stdout echo by adding noout parameter.

Stream delivery:

By default, xmllint transfers file names. If we want to transfer data by piping file streams, we can do this:


[email protected]:~$ cat sample.xml |xmllint --format -
<?xml version="1.0"?>
<?xml version="1.0"?>
<books>
 <book>
 <name>book1</name>
 <price>100</price>
 <license/>
 </book>
 <book>
 <name>book2</name>
 <price>200</price>
 </book>
 <book>
 <name>book3</name>
 <price>300</price>
 </book>
</books>

xpath

Sketch

The XPath tool is actually a packaged Perl script, and it has only 200 lines. Its function is more specific, that is, to provide query function of xpath. He generally belongs to the libxml-xpath-perl package, so it’s similar to libxml-xpath-perl.sudo apt install libxml-xpath-perlThe command can be installed. Systems like SUSE also come directly with them.

function

The versions installed in different systems may be different, but the basic functions are similar:


[email protected]:~$ xpath -e '//book/name/text()' sample.xml
Found 3 nodes in sample.xml:
-- NODE --
book1
-- NODE --
book2
-- NODE --
book3

By default, the query results will be output to stdout, and the description information will be output to stderr. If you want to collect results easily, you can redirect stderr to / dev / null, or add the – Q parameter:


[email protected]:~$ xpath -e '//book/name/text()' sample.xml 2>/dev/null
book1
book2
book3
[email protected]:~$ xpath -q -e '//book/name/text()' sample.xml
book1
book2
book3

XPath is a little different from XPath functionality of xmllint. If XPath matches multiple results, XPath will output in separate lines, while xmllint will knead to one line:


[email protected]:~$ xmllint --xpath "//book/name/text()" sample.xml
book1book2book3

xml2

Sketch

Xml2 is a tool that feels like it doesn’t know many people, but it works wonderfully with other commands in some scenarios. The developer’s blog for this tool seems to be dead, but visualization should be written with a small tool in C and libxml2 libraries. Usually in the xml2 package, so commands like sudo apt install xml2 can be installed.

function

The tool contains six commands: XML 2, 2xml, HTML 2, 2html, csv2, 2csv, and unix. It converts xml, HTML and CSV formats to what he calls “flat format”. For instance:


[email protected]:~$ cat sample.xml |xml2
/books/book/@id=1
/books/book/name=book1
/books/book/price=100
/books/book
/books/book/@id=2
/books/book/name=book2
/books/book/price=200
/books/book
/books/book/@id=3
/books/book/name=book3
/books/book/price=300
[email protected]:~$ cat sample.xml |xml2|2xml
<books><book><name>book1</name><price>100</price></book><book><name>book2</name><price>200</price></book><book><name>book3</name><price>300</price></book></books>

This custom format is very simple and ingenious. Some of them represent new nodes (/ books / book), some assign values to nodes (/ books / book / name = book1), and some assign values to attributes of nodes (/ books / book /@id = 1). Writing is very similar to xpath, but not exactly the same. And two corresponding commands can be idempotent together.

So what’s the use of this transformation command? In fact, we often encounter the need to create XML files, but it is very difficult to generate them dynamically according to the XML format. At this time, it is very convenient to use flat format as a transit:


#!/usr/bin/env bash
tempFile=$(mktemp tmp.XXXX)
function addBook(){
 id=$1
 name=$2
 price=$3
 echo "/books/book">>$tempFile
 echo "/books/book/@id=$id">>$tempFile
 echo "/books/book/name=$name">>$tempFile
 echo "/books/book/price=$price">>$tempFile
}
function main(){
 addBook 1 book1 100
 addBook 2 book2 200
 addBook 3 book3 300
 cat $tempFile|2xml|xmllint --format --output new_sample.xml -
 rm $tempFile
}
main "[email protected]"

The above code generates the same new_sample.xml as sample.xml.

summary

Above is the whole content of this article. I hope that the content of this article has a certain reference value for everyone’s study or work. If you have any questions, you can leave a message and exchange it. Thank you for your support to developpaer.

Recommended Today

Hadoop MapReduce Spark Configuration Item

Scope of application The configuration items covered in this article are mainly for Hadoop 2.x and Spark 2.x. MapReduce Official documents https://hadoop.apache.org/doc…Lower left corner: mapred-default.xml Examples of configuration items name value description mapreduce.job.reduce.slowstart.completedmaps 0.05 Resource requests for Reduce Task will not be made until the percentage of Map Task completed reaches that value. mapreduce.output.fileoutputformat.compress false […]