Python XML parsing



What is XML?

XML refers to Extensible Markup Language (eXtensible Markup Language). You can learn XML tutorials through this site

XML is designed to transmit and store data.

XML is a set of rules that define semantic tags that divide a document into components and identify these components.

It is also a meta-markup language, that is, it defines a syntactic language for defining other semantic and structured markup languages ​​related to specific fields.


Python's parsing of XML

Common XML programming interfaces include DOM and SAX. These two interfaces process XML files in different ways, and of course the usage scenarios are also different.

Python has three methods to parse XML, SAX, DOM, and ElementTree:

1.SAX (simple API for XML)

python standard library contains SAX parser, SAX Using the event-driven model, XML files are processed by triggering events one by one and calling user-defined callback functions during the process of parsing XML.

2.DOM (Document Object Model)

Parses XML data into a tree in memory, and operates XML by operating on the tree.

3.ElementTree (Element Tree)

ElementTree is like a lightweight DOM with a convenient and friendly API. The code has good usability, is fast, and consumes less memory.

Note: Because DOM needs to map XML data to a tree in memory, firstly, it is relatively slow, and secondly, it consumes more memory, while SAX streaming reading of XML files is faster. It takes up less memory, but requires the user to implement a callback function (handler).

The content of the XML example file movies.xml used in this chapter is as follows:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

python uses SAX to parse xml

SAX is an event-driven API .

Using SAX to parse XML documents involves two parts: the parser and the event handler.

The parser is responsible for reading the XML document and sending events to the event processor, such as element start and element end events;

The event processor is responsible for responding to the event and passing the XML data is processed.

<psax is suitable for handling the following problems:< p="">
  • 1. Processing of large files;

  • 2. Only part of the content of the file is required, or only specific information is required from the file.

  • 3. When you want to build your own object model.

To use sax to process xml in python, you must first introduce the parse function in xml.sax and the ContentHandler in xml.sax.handler.

ContentHandler class method introduction

characters(content) method

Calling timing:

Start from the line, before encountering the label , there are characters, and the value of content is these strings.

From one label, there are characters before encountering the next label, and the value of content is these strings.

From a label, there are characters before encountering the line terminator, and the value of content is these strings. The

tag can be a start tag or an end tag.

startDocument() method

Called when the document is started.

endDocument() method

Called when the parser reaches the end of the document.

startElement(name, attrs) method

Called when an XML start tag is encountered, name is the name of the tag, and attrs is the attribute value dictionary of the tag.

endElement(name) method

Called when an XML end tag is encountered.


make_parser method

The following method creates a new parser object and returns it.

xml.sax.make_parser( [parser_list] )

Parameter description:

  • parser_list - Optional parameters, parser list


parser method

The following method creates a SAX parser and parses the xml document:

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

Parameters Description:

  • xmlfile - xml file name

  • contenthandler - Must be a ContentHandler object

  • ##errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler object


parseString method

The parseString method creates an XML parser and parses the xml string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])

Parameter description:

  • ##xmlstring

    - xml string

  • contenthandler

    - Must be a ContentHandler object

  • ##errorhandler
  • - If this parameter is specified, errorhandler must be a SAX ErrorHandler object

  • Python parsing XML instance
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # 元素开始事件处理
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes["title"]
         print "Title:", title

   # 元素结束事件处理
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # 内容事件处理
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
  
if ( __name__ == "__main__"):
   
   # 创建一个 XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # 重写 ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   
   parser.parse("movies.xml")

The execution result of the above code is as follows:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

For complete SAX API documentation, please refer to Python SAX APIs

Use xml.dom to parse xml


Document Object Model (DOM), which is the processing recommended by the W3C organization A standard programming interface for extensible markup languages.

When a DOM parser parses an XML document, it reads the entire document at once and saves all the elements in the document in a tree structure in memory. After that, you can use the different functions provided by the DOM. To read or modify the content and structure of the document, you can also write the modified content into an xml file.

xml.dom.minidom is used in python to parse xml files. The example is as follows:

#!/usr/bin/python
# -*- coding: UTF-8 -*-

from xml.dom.minidom import parse
import xml.dom.minidom

# 使用minidom解析器打开 XML 文档
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print "Root element : %s" % collection.getAttribute("shelf")

# 在集合中获取所有电影
movies = collection.getElementsByTagName("movie")

# 打印每部电影的详细信息
for movie in movies:
   print "*****Movie*****"
   if movie.hasAttribute("title"):
      print "Title: %s" % movie.getAttribute("title")

   type = movie.getElementsByTagName('type')[0]
   print "Type: %s" % type.childNodes[0].data
   format = movie.getElementsByTagName('format')[0]
   print "Format: %s" % format.childNodes[0].data
   rating = movie.getElementsByTagName('rating')[0]
   print "Rating: %s" % rating.childNodes[0].data
   description = movie.getElementsByTagName('description')[0]
   print "Description: %s" % description.childNodes[0].data

The execution results of the above program are as follows:

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

For complete DOM API documentation, please refer to Python DOM APIs.