Python XML parsing
What is XML?
XML refers to Extensible Markup Language (eXtensible Markup Language). You can learn XML tutorials through this site
XML is designed to transmit and store data.
XML is a set of rules that define semantic tags that divide a document into components and identify these components.
It is also a meta-markup language, that is, it defines a syntactic language for defining other semantic and structured markup languages related to specific fields.
Python's parsing of XML
Common XML programming interfaces include DOM and SAX. These two interfaces process XML files in different ways, and of course the usage scenarios are also different.
Python has three methods to parse XML, SAX, DOM, and ElementTree:
1.SAX (simple API for XML)
python standard library contains SAX parser, SAX Using the event-driven model, XML files are processed by triggering events one by one and calling user-defined callback functions during the process of parsing XML.
2.DOM (Document Object Model)
Parses XML data into a tree in memory, and operates XML by operating on the tree.
3.ElementTree (Element Tree)
ElementTree is like a lightweight DOM with a convenient and friendly API. The code has good usability, is fast, and consumes less memory.
Note: Because DOM needs to map XML data to a tree in memory, firstly, it is relatively slow, and secondly, it consumes more memory, while SAX streaming reading of XML files is faster. It takes up less memory, but requires the user to implement a callback function (handler).
The content of the XML example file movies.xml used in this chapter is as follows:
<collection shelf="New Arrivals"> <movie title="Enemy Behind"> <type>War, Thriller</type> <format>DVD</format> <year>2003</year> <rating>PG</rating> <stars>10</stars> <description>Talk about a US-Japan war</description> </movie> <movie title="Transformers"> <type>Anime, Science Fiction</type> <format>DVD</format> <year>1989</year> <rating>R</rating> <stars>8</stars> <description>A schientific fiction</description> </movie> <movie title="Trigun"> <type>Anime, Action</type> <format>DVD</format> <episodes>4</episodes> <rating>PG</rating> <stars>10</stars> <description>Vash the Stampede!</description> </movie> <movie title="Ishtar"> <type>Comedy</type> <format>VHS</format> <rating>PG</rating> <stars>2</stars> <description>Viewable boredom</description> </movie> </collection>
python uses SAX to parse xml
SAX is an event-driven API .
Using SAX to parse XML documents involves two parts: the parser and the event handler.
The parser is responsible for reading the XML document and sending events to the event processor, such as element start and element end events;
The event processor is responsible for responding to the event and passing the XML data is processed.
<psax is suitable for handling the following problems:< p="">1. Processing of large files;
2. Only part of the content of the file is required, or only specific information is required from the file.
3. When you want to build your own object model.
To use sax to process xml in python, you must first introduce the parse function in xml.sax and the ContentHandler in xml.sax.handler.
ContentHandler class method introduction
characters(content) method
Calling timing:
Start from the line, before encountering the label , there are characters, and the value of content is these strings.
From one label, there are characters before encountering the next label, and the value of content is these strings.
From a label, there are characters before encountering the line terminator, and the value of content is these strings. The
tag can be a start tag or an end tag.
startDocument() method
Called when the document is started.
endDocument() method
Called when the parser reaches the end of the document.
startElement(name, attrs) method
Called when an XML start tag is encountered, name is the name of the tag, and attrs is the attribute value dictionary of the tag.
endElement(name) method
Called when an XML end tag is encountered.
make_parser method
The following method creates a new parser object and returns it.
xml.sax.make_parser( [parser_list] )
Parameter description:
parser_list - Optional parameters, parser list
parser method
The following method creates a SAX parser and parses the xml document:
xml.sax.parse( xmlfile, contenthandler[, errorhandler])
Parameters Description:
xmlfile - xml file name
contenthandler - Must be a ContentHandler object
##errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler object
parseString method The parseString method creates an XML parser and parses the xml string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])Parameter description:
- ##xmlstring
- xml string
- contenthandler
- Must be a ContentHandler object
##errorhandler - -
If this parameter is specified, errorhandler must be a SAX ErrorHandler object
#!/usr/bin/python # -*- coding: UTF-8 -*- import xml.sax class MovieHandler( xml.sax.ContentHandler ): def __init__(self): self.CurrentData = "" self.type = "" self.format = "" self.year = "" self.rating = "" self.stars = "" self.description = "" # 元素开始事件处理 def startElement(self, tag, attributes): self.CurrentData = tag if tag == "movie": print "*****Movie*****" title = attributes["title"] print "Title:", title # 元素结束事件处理 def endElement(self, tag): if self.CurrentData == "type": print "Type:", self.type elif self.CurrentData == "format": print "Format:", self.format elif self.CurrentData == "year": print "Year:", self.year elif self.CurrentData == "rating": print "Rating:", self.rating elif self.CurrentData == "stars": print "Stars:", self.stars elif self.CurrentData == "description": print "Description:", self.description self.CurrentData = "" # 内容事件处理 def characters(self, content): if self.CurrentData == "type": self.type = content elif self.CurrentData == "format": self.format = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "rating": self.rating = content elif self.CurrentData == "stars": self.stars = content elif self.CurrentData == "description": self.description = content if ( __name__ == "__main__"): # 创建一个 XMLReader parser = xml.sax.make_parser() # turn off namepsaces parser.setFeature(xml.sax.handler.feature_namespaces, 0) # 重写 ContextHandler Handler = MovieHandler() parser.setContentHandler( Handler ) parser.parse("movies.xml")
The execution result of the above code is as follows:
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom
For complete SAX API documentation, please refer to Python SAX APIsUse xml.dom to parse xml
Document Object Model (DOM), which is the processing recommended by the W3C organization A standard programming interface for extensible markup languages. When a DOM parser parses an XML document, it reads the entire document at once and saves all the elements in the document in a tree structure in memory. After that, you can use the different functions provided by the DOM. To read or modify the content and structure of the document, you can also write the modified content into an xml file. xml.dom.minidom is used in python to parse xml files. The example is as follows:
#!/usr/bin/python # -*- coding: UTF-8 -*- from xml.dom.minidom import parse import xml.dom.minidom # 使用minidom解析器打开 XML 文档 DOMTree = xml.dom.minidom.parse("movies.xml") collection = DOMTree.documentElement if collection.hasAttribute("shelf"): print "Root element : %s" % collection.getAttribute("shelf") # 在集合中获取所有电影 movies = collection.getElementsByTagName("movie") # 打印每部电影的详细信息 for movie in movies: print "*****Movie*****" if movie.hasAttribute("title"): print "Title: %s" % movie.getAttribute("title") type = movie.getElementsByTagName('type')[0] print "Type: %s" % type.childNodes[0].data format = movie.getElementsByTagName('format')[0] print "Format: %s" % format.childNodes[0].data rating = movie.getElementsByTagName('rating')[0] print "Rating: %s" % rating.childNodes[0].data description = movie.getElementsByTagName('description')[0] print "Description: %s" % description.childNodes[0].dataThe execution results of the above program are as follows:
Root element : New Arrivals *****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Rating: PG Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Rating: R Description: A schientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Description: Viewable boredomFor complete DOM API documentation, please refer to Python DOM APIs.