Python 3 standard library: xml.etree.ElementTree XML manipulation API

Time:2020-5-30

1.  xml.etree.ElementTree XML manipulation API

The elementtree library provides tools for parsing XML using event based and document based APIs, searching for parsed documents using XPath expressions, creating new documents or modifying existing documents.

1.1 parsing XML documents

The parsed XML documents are represented in memory by elementtree and element objects, which are connected by tree structure based on the way of node nesting in XML documents.

When parsing a complete document with parse(), an elementtree instance is returned. This tree knows all the data in the input document, and can search or manipulate the nodes in the tree in place. Based on this flexibility, it is more convenient to process the parsed documents. However, compared with the event based parsing method, this method often needs more memory, because the whole document must be loaded at one time.

For simple small documents (such as the podcast list below, which is represented as an OPML outline), there is little memory requirement.

podcasts.opml:

My Podcasts
    Sat, 06 Aug 2016 15:53:26 GMT
    Sat, 06 Aug 2016 15:53:26 GMT

To parse this document, you need to pass an open file handle to parse().

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

print(tree)

This method reads the data, parses the XML, and returns an elementtree object.

1.2 traversing the parse tree

To access all child nodes in order, you can use ITER () to create a generator that iterates over the elementtree instance.

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.iter():
    print(node.tag)

This example prints the entire tree, one tag at a time.

If you just print the name group and feed URL of the podcast, you can iterate through the outline node (regardless of all the data in the header), and print the text and xmlurl attributes by looking up the values in the attrib dictionary.

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.iter('outline'):
    name = node.attrib.get('text')
    url = node.attrib.get('xmlUrl')
    if name and url:
        print('  %s' % name)
        print('    %s' % url)
    else:
        print(name)

The ‘outline’ parameter of iter() means that only nodes marked ‘outline’ are processed.

1.3 finding nodes in a document

It can be very error prone to view the entire tree and search for related nodes. The previous example had to look at each outline node to determine whether it was a group (a node with only one text attribute) or a podcast (a node with text and xmlurl). To generate a simple list of podcast feed URLs without names or groups, simplify the logic by using findall () to find nodes with more descriptive search features.

Make the first change to the first version above, and use an XPath parameter to find all the outline nodes.

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.findall('.//outline'):
    url = node.attrib.get('xmlUrl')
    if url:
        print(url)

The logic in this version is not significantly different from the version using getiterator(). You still have to check for URLs, but if you don’t find one, it won’t print the group name.

The outline node has only two layers of nesting. You can use this to change the search path to. / outline / outline, which means that the loop only deals with the second layer of the outline node.

from xml.etree import ElementTree

with open('podcasts.opml', 'rt') as f:
    tree = ElementTree.parse(f)

for node in tree.findall('.//outline/outline'):
    url = node.attrib.get('xmlUrl')
    print(url)

All the outline nodes with two levels of nesting depth in the input think that there is an XML URL attribute pointing to the podcast feed, so the loop can not check before using this attribute.

However, this version is limited to the current structure, so if the outline node is reorganized into a deeper tree, this version will not work properly.

1.4 resolving node attributes

The elements returned by findall() and iter() are element objects, each representing a node in the XML parse tree. Each element has attributes that can be used to retrieve data from the XML. You can use a slightly farfetched example to input a file data.xml To illustrate this behavior.

Regular text.
    Regular text."Tail" text.
    
    
        That & This

You can get the XML attribute of the node from the attrib attribute, which is like a dictionary.

from xml.etree import ElementTree

with open('data.xml', 'rt') as f:
    tree = ElementTree.parse(f)

node = tree.find('./with_attributes')
print(node.tag)
for name,value in sorted(node.attrib.items()):
    print(name,value)

The node on line 5 of the input file has two attributes, name and foo.

You can also get the text content of the node and the tail text after the end tag.

from xml.etree import ElementTree

with open('data.xml', 'rt') as f:
    tree = ElementTree.parse(f)

for path in ['./child','./child_with_tail']:
    node = tree.find(path)
    print(node.tag)
    print('child node text:',node.text)
    print('and tail text:',node.tail)

The child node on line 3 contains embedded text, and the node on line 4 contains text with tail (including whitespace).

Before returning a value, the XML entity reference embedded in the document is converted to the appropriate character.

from xml.etree import ElementTree

with open('data.xml', 'rt') as f:
    tree = ElementTree.parse(f)

node = tree.find('entity_expansion')
print(node.tag)
print('in attribute:',node.attrib['attribute'])
print('in text:',node.text.strip())

This automatic transformation means that implementation details that represent certain characters in an XML document can be ignored.

1.5 monitoring events during parsing

Another API for working with XML documents is event based. The parser generates the start event for the start tag and the end event for the end tag. In the parsing phase, you can extract data from documents by processing event flows iteratively. If it is not necessary to process the whole document in the future, or it is not necessary to save the parsed documents in memory, then the event based API will be very convenient.

There are the following event types:

Start encountered a new tag. The closing angle brackets of the tag are processed, but the content is not.

End has processed the closing angle bracket of the closing tag. All child nodes have been processed.

Start ns ends a namespace declaration.

End ns ends a namespace declaration.

Iterparse() returns an Iterable, which generates a tuple containing the event name and the node that triggered the event.

from xml.etree.ElementTree import iterparse

depth = 0
prefix_width = 8
prefix_dots = '.' * prefix_width
line_template = '.'.join([
    '{prefix:<0.{prefix_len}}',
    '{event:<8}',
    '{suffix:

By default, only end events are generated. To see other events, pass in the list of event names you need to iterparse().

It is more natural for some operations to process in an event fashion, such as converting XML input to another format. You can use this technique to convert playlists (from the previous example) from XML files to a CSV file to load them into a spreadsheet or database application.

import csv
import sys
from xml.etree.ElementTree import iterparse

writer = csv.writer(sys.stdout,quoting=csv.QUOTE_NONNUMERIC)
group_name = ''

parsing = iterparse('podcasts.opml',events=['start'])

for (event,node) in parsing:
    if node.tag != 'outline':
        # Ignore anything not part of the outline.
        continue
    if not node.attrib.get('xmlUrl'):
        #Remember the current group.
        group_name = node.attrib['text']
    else:
        #Output a podcast entry.
        writer.writerow(
            (group_name,node.attrib['text'],
             node.attrib['xmlUrl'],
             node.attrib.get('htmlUrl',''))
        )

This conversion program does not need to save the entire parsed input file in memory. It only processes when it encounters each node in the input, which will be more efficient.

1.6 create a custom tree builder

One possible more efficient way to handle parsing events is to replace the standard tree builder behavior with a custom behavior. The xmlparser uses a treebuilder to process the XML and calls the methods of the target class to save the results. Usually the output is an ElementTree instance created by the default TreeBuilder class. You can save this overhead by replacing treebuilder with another class that receives events before instantiating the element node.

The xml-csv converter can be reimplemented as a tree builder.

import io
import csv
import sys
from xml.etree.ElementTree import XMLParser

class PodcastListToCSV(object):
    def __init__(self,outputFile):
        self.writer = csv.writer(
            outputFile,
            quoting = csv.QUOTE_NONNUMERIC,
        )
        self.group_name = ''

    def start(self,tag,attrib):
        if tag != 'outline':
            # Ignore anything not part of the outline.
            return
        if not attrib.get('xmlUrl'):
            #Remember the current group.
            self.group_name = attrib['text']
        else:
            #Output a pddcast entry.
            self.writer.writerow(
                (self.group_name,
                attrib['text'],
                attrib['xmlUrl'],
                attrib.get('htmlUrl',''))
            )
    def end(self,tag):
        "Ignore closing tags"
    def data(self,data):
        "Ignore data inside nodes"
    def close(self):
        "Nothing special to do here"

target = PodcastListToCSV(sys.stdout)
parser = XMLParser(target=target)
with open('podcasts.opml','rt') as f:
    for line in f:
        parser.feed(line)
parser.close()

PodcastListToCSV implements the TreeBuilder protocol. Each time a new XML tag is encountered, start() is called and the tagName and properties are provided. When you see an end tag, end() is called based on that tag name. In between, if a node has content, data () is called (it is generally assumed that the tree builder keeps track of the “current” node). Close() is called when all input has been processed. It returns a value to the user of xmltreebuilder.

1.7 construct documents with element nodes

In addition to parsing, xml.etree.ElementTree It also supports creating well structured XML documents from element objects constructed in the application. The element class used when parsing a document also knows how to generate a serialized form of its content, which can then be written to a file or other data stream.

There are three helper functions that are useful for creating an element node hierarchy. Element () creates a standard node, subelement () associates a new node with a parent node, and comment () creates a node that serializes data using XML annotation syntax.

from xml.etree.ElementTree import Element,SubElement,Comment,tostring

top = Element('top')

comment = Comment('Generated for PyMOTW')
top.append(comment)

child = SubElement(top,'child')
child.text = 'This child contains text.'

child_with_tail = SubElement(top,'child_with_tail')
child_with_tail.text = 'This child has text.'
child_with_tail.tail = 'And "tail" text.'

child_with_entity_ref = SubElement(top,'child_with_entity_ref')
child_with_entity_ref.text = 'This & that'

print(tostring(top))

This output contains only the XML nodes in the tree, not the version and encoded XML declarations.

1.8 beautiful print XML

Elementtree does not improve readability by formatting the output of tostring(), because adding extra white space changes the content of the document. To make the output easier to read, the following example uses xml.dom.minidom Parse the XML and use its toprettyxml () method.

from xml.etree import ElementTree
from xml.dom import minidom
from xml.etree.ElementTree import Element,SubElement,Comment,tostring

def prettify(elem):
    """
    Return a pretty-printed XML string for the Element.
    """
    rough_string = ElementTree.tostring(elem,'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ")

top = Element('top')

comment = Comment('Generated for PyMOTW')
top.append(comment)

child = SubElement(top,'child')
child.text = 'This child contains text.'

child_with_tail = SubElement(top,'child_with_tail')
child_with_tail.text = 'This child has text.'
child_with_tail.tail = 'And "tail" text.'

child_with_entity_ref = SubElement(top,'child_with_entity_ref')
child_with_entity_ref.text = 'This & that'

print(prettify(top))

The output becomes more readable.

In addition to adding extra white space for formatting, xml.dom.minidom The beauty printer also adds an XML declaration to the output.