Analyze several ways Python parses XML-Python Tutorial-php.cn

Home

Backend Development

Python Tutorial

Analyze several ways Python parses XML

巴扎黑

Sep 19, 2017 am 10:20 AM

pythonseveral kindsWay

When I first learned PYTHON, I only knew that there were two parsing methods, DOM and SAX, but their efficiency was not ideal. Due to the large number of files that needed to be processed, these two methods were too time-consuming and unacceptable.

After searching on the Internet, I found that ElementTree, which is currently widely used and relatively efficient, is also an algorithm recommended by many people, so I used this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is Normal ElementTree(ET), one is ElementTree.iterparse(ET_iter).

This article will conduct a horizontal comparison of the four methods of DOM, SAX, ET, and ET_iter, and evaluate the efficiency of each algorithm by comparing the time it takes to process the same files.

In the program, the four parsing methods are written as functions and called separately in the main program to evaluate their parsing efficiency.

The decompressed XML file content example is:

Analyze several ways Python parses XML

The main program function call part code is:

  print("文件计数：%d/%d." % (gz_cnt,paser_num))
  str_s,cnt = dom_parser(gz)
  #str_s,cnt = sax_parser(gz)
  #str_s,cnt = ET_parser(gz)
  #str_s,cnt = ET_parser_iter(gz)
  output.write(str_s)
  vs_cnt += cnt

In the initial function call The function returns two values, but when receiving the function call value, it is called with two variables separately, causing each function to be executed twice. It was later modified to call two variables at once to receive the return value, reducing invalid calls.

1. DOM parsing

Function definition code:

def dom_parser(gz):
  import gzip,cStringIO
  import xml.dom.minidom
  
  vs_cnt = 0
  str_s = &#39;&#39;
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,&#39;rb&#39;)
  print("已读入：%s.\n解析中：" % (os.path.abspath(gz)))
  doc = xml.dom.minidom.parseString(xm.read())
  bulkPmMrDataFile = doc.documentElement
  #读入子元素
  enbs = bulkPmMrDataFile.getElementsByTagName("eNB")
  measurements = enbs[0].getElementsByTagName("measurement")
  objects = measurements[0].getElementsByTagName("object")
  #写入csv文件
  for object in objects:
    vs = object.getElementsByTagName("v")
    vs_cnt += len(vs)
    for v in vs:
      file_io.write(enbs[0].getAttribute("id")+&#39; &#39;+object.getAttribute("id")+&#39; &#39;+\
      object.getAttribute("MmeUeS1apId")+&#39; &#39;+object.getAttribute("MmeGroupId")+&#39; &#39;+object.getAttribute("MmeCode")+&#39; &#39;+\
      object.getAttribute("TimeStamp")+&#39; &#39;+v.childNodes[0].data+&#39;\n&#39;) #获取文本值
  str_s = (((file_io.getvalue().replace(&#39; \n&#39;,&#39;\r\n&#39;)).replace(&#39; &#39;,&#39;,&#39;)).replace(&#39;T&#39;,&#39; &#39;)).replace(&#39;NIL&#39;,&#39;&#39;)
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

……………………………………………………

File count: 12/12.

Read in:/tmcdata/mro2csv/input31/TD- LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count: 177849, running time: 107.077867, rows processed per second: 1660.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

Program processing ends.

Since DOM parsing requires reading the entire file into memory and establishing a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and there is no need to define a callback function, which is easy to implement.

2. SAX parsing

Function definition code:

def sax_parser(gz):
  import os,gzip,cStringIO
  from xml.parsers.expat import ParserCreate
  #变量声明
  d_eNB = {}
  d_obj = {}
  s = &#39;&#39;
  global flag 
  flag = False
  file_io = cStringIO.StringIO()
  
  #Sax解析类
  class DefaultSaxHandler(object):
    #处理开始标签
    def start_element(self, name, attrs):
      global d_eNB
      global d_obj
      global vs_cnt
      if name == &#39;eNB&#39;:
        d_eNB = attrs
      elif name == &#39;object&#39;:
        d_obj = attrs
      elif name == &#39;v&#39;:
        file_io.write(d_eNB[&#39;id&#39;]+&#39; &#39;+ d_obj[&#39;id&#39;]+&#39; &#39;+d_obj[&#39;MmeUeS1apId&#39;]+&#39; &#39;+d_obj[&#39;MmeGroupId&#39;]+&#39; &#39;+d_obj[&#39;MmeCode&#39;]+&#39; &#39;+d_obj[&#39;TimeStamp&#39;]+&#39; &#39;)
        vs_cnt += 1
      else:
        pass
    #处理中间文本
    def char_data(self, text):
      global d_eNB
      global d_obj
      global flag
      if text[0:1].isnumeric():
        file_io.write(text)
      elif text[0:17] == &#39;MR.LteScPlrULQci1&#39;:
        flag = True
        #print(text,flag)
      else:
        pass
    #处理结束标签
    def end_element(self, name):
      global d_eNB
      global d_obj
      if name == &#39;v&#39;:
        file_io.write(&#39;\n&#39;)
      else:
        pass
  
  #Sax解析调用
  handler = DefaultSaxHandler()
  parser = ParserCreate()
  parser.StartElementHandler = handler.start_element
  parser.EndElementHandler = handler.end_element
  parser.CharacterDataHandler = handler.char_data
  vs_cnt = 0
  str_s = &#39;&#39;
  xm = gzip.open(gz,&#39;rb&#39;)
  print("已读入：%s.\n解析中：" % (os.path.abspath(gz)))
  for line in xm.readlines():
    parser.Parse(line) #解析xml文件内容
    if flag:
      break
  str_s = file_io.getvalue().replace(&#39; \n&#39;,&#39;\r\n&#39;).replace(&#39; &#39;,&#39;,&#39;).replace(&#39;T&#39;,&#39; &#39;).replace(&#39;NIL&#39;,&#39;&#39;)  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

........................................

File count: 12/12.

Read in: /tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count :177849, running time: 14.386779, rows processed per second: 12361.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

The program processing ends.

SAX parsing has a significantly shorter running time than DOM parsing. Since SAX uses line-by-line parsing, it takes up less memory for processing larger files. Therefore, SAX parsing is a parsing method that is currently widely used. The disadvantage is that you need to implement the callback function yourself, and the logic is relatively complicated.

3. ET analysis

Function definition code:

def ET_parser(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = &#39;&#39;
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,&#39;rb&#39;)
  print("已读入：%s.\n解析中：" % (os.path.abspath(gz)))
  tree = ET.ElementTree(file=xm)
  root = tree.getroot()
  for elem in root[1][0].findall(&#39;object&#39;):
      for v in elem.findall(&#39;v&#39;):
          file_io.write(root[1].attrib[&#39;id&#39;]+&#39; &#39;+elem.attrib[&#39;TimeStamp&#39;]+&#39; &#39;+elem.attrib[&#39;MmeCode&#39;]+&#39; &#39;+\
          elem.attrib[&#39;id&#39;]+&#39; &#39;+ elem.attrib[&#39;MmeUeS1apId&#39;]+&#39; &#39;+ elem.attrib[&#39;MmeGroupId&#39;]+&#39; &#39;+ v.text+&#39;\n&#39;)
      vs_cnt += 1
  str_s = file_io.getvalue().replace(&#39; \n&#39;,&#39;\r\n&#39;).replace(&#39; &#39;,&#39;,&#39;).replace(&#39;T&#39;,&#39; &#39;).replace(&#39;NIL&#39;,&#39;&#39;)  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

****************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

...........................................

文件计数：12/12.

已读入：/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中：

VS行计数：177849，运行时间：4.308103，每秒处理行数：41282。

已写入：/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

相较于SAX解析，ET解析时间更短，并且函数实现也比较简单，所以ET具有类似DOM的简单逻辑实现且匹敌SAX的解析效率，因此ET是目前XML解析的首选。

4、ET_iter解析

函数定义代码：

def ET_parser_iter(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = &#39;&#39;
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,&#39;rb&#39;)
  print("已读入：%s.\n解析中：" % (os.path.abspath(gz)))
  d_eNB = {}
  d_obj = {}
  i = 0
  for event,elem in ET.iterparse(xm,events=(&#39;start&#39;,&#39;end&#39;)):
    if i >= 2:
      break    
    elif event == &#39;start&#39;:
          if elem.tag == &#39;eNB&#39;:
              d_eNB = elem.attrib
          elif elem.tag == &#39;object&#39;:
        d_obj = elem.attrib
      elif event == &#39;end&#39; and elem.tag == &#39;smr&#39;:
      i += 1
    elif event == &#39;end&#39; and elem.tag == &#39;v&#39;:
      file_io.write(d_eNB[&#39;id&#39;]+&#39; &#39;+d_obj[&#39;TimeStamp&#39;]+&#39; &#39;+d_obj[&#39;MmeCode&#39;]+&#39; &#39;+d_obj[&#39;id&#39;]+&#39; &#39;+\
      d_obj[&#39;MmeUeS1apId&#39;]+&#39; &#39;+ d_obj[&#39;MmeGroupId&#39;]+&#39; &#39;+str(elem.text)+&#39;\n&#39;)
          vs_cnt += 1
      elem.clear()
  str_s = file_io.getvalue().replace(&#39; \n&#39;,&#39;\r\n&#39;).replace(&#39; &#39;,&#39;,&#39;).replace(&#39;T&#39;,&#39; &#39;).replace(&#39;NIL&#39;,&#39;&#39;)  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

程序运行结果：

**************************************************

程序处理启动。

输入目录为：/tmcdata/mro2csv/input31/。

输出目录为：/tmcdata/mro2csv/output31/。

输入目录下.gz文件个数为：12，本次处理其中的12个。

**************************************************

文件计数：1/12.

已读入：/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

解析中：

文件计数：2/12.

已读入：/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

解析中：

文件计数：3/12.

已读入：/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

解析中：

...................................................

文件计数：12/12.

已读入：/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中：

VS行计数：177849，运行时间：3.043805，每秒处理行数：58429。

已写入：/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

在引入了ET_iter解析后，解析效率比ET提升了近50%，而相较于DOM解析更是提升了35倍，在解析效率提升的同时，由于其采用了iterparse这个循序解析的工具，其内存占用也是比较小的。

The above is the detailed content of Analyze several ways Python parses XML. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What are some common reasons why a Python script might not execute on Unix?Apr 28, 2025 am 12:18 AM

The reasons why Python scripts cannot run on Unix systems include: 1) Insufficient permissions, using chmod xyour_script.py to grant execution permissions; 2) Shebang line is incorrect or missing, you should use #!/usr/bin/envpython; 3) The environment variables are not set properly, and you can print os.environ debugging; 4) Using the wrong Python version, you can specify the version on the Shebang line or the command line; 5) Dependency problems, using virtual environment to isolate dependencies; 6) Syntax errors, using python-mpy_compileyour_script.py to detect.

Give an example of a scenario where using a Python array would be more appropriate than using a list.Apr 28, 2025 am 12:15 AM

Using Python arrays is more suitable for processing large amounts of numerical data than lists. 1) Arrays save more memory, 2) Arrays are faster to operate by numerical values, 3) Arrays force type consistency, 4) Arrays are compatible with C arrays, but are not as flexible and convenient as lists.

What are the performance implications of using lists versus arrays in Python?Apr 28, 2025 am 12:10 AM

Listsare Better ForeflexibilityandMixdatatatypes, Whilearraysares Superior Sumerical Computation Sand Larged Datasets.1) Unselable List Xibility, MixedDatatypes, andfrequent elementchanges.2) Usarray's sensory -sensical operations, Largedatasets, AndwhenMemoryEfficiency

How does NumPy handle memory management for large arrays?Apr 28, 2025 am 12:07 AM

NumPymanagesmemoryforlargearraysefficientlyusingviews,copies,andmemory-mappedfiles.1)Viewsallowslicingwithoutcopying,directlymodifyingtheoriginalarray.2)Copiescanbecreatedwiththecopy()methodforpreservingdata.3)Memory-mappedfileshandlemassivedatasetsb

Which requires importing a module: lists or arrays?Apr 28, 2025 am 12:06 AM

ListsinPythondonotrequireimportingamodule,whilearraysfromthearraymoduledoneedanimport.1)Listsarebuilt-in,versatile,andcanholdmixeddatatypes.2)Arraysaremorememory-efficientfornumericdatabutlessflexible,requiringallelementstobeofthesametype.

What data types can be stored in a Python array?Apr 27, 2025 am 12:11 AM

Pythonlistscanstoreanydatatype,arraymodulearraysstoreonetype,andNumPyarraysarefornumericalcomputations.1)Listsareversatilebutlessmemory-efficient.2)Arraymodulearraysarememory-efficientforhomogeneousdata.3)NumPyarraysareoptimizedforperformanceinscient

What happens if you try to store a value of the wrong data type in a Python array?Apr 27, 2025 am 12:10 AM

WhenyouattempttostoreavalueofthewrongdatatypeinaPythonarray,you'llencounteraTypeError.Thisisduetothearraymodule'sstricttypeenforcement,whichrequiresallelementstobeofthesametypeasspecifiedbythetypecode.Forperformancereasons,arraysaremoreefficientthanl

Which is part of the Python standard library: lists or arrays?Apr 27, 2025 am 12:03 AM

Pythonlistsarepartofthestandardlibrary,whilearraysarenot.Listsarebuilt-in,versatile,andusedforstoringcollections,whereasarraysareprovidedbythearraymoduleandlesscommonlyusedduetolimitedfunctionality.

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

1 months agoByDDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks agoByDDD

Where to find the Crane Control Keycard in Atomfall

1 months agoByDDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks agoByDDD

InZoi: How To Apply To School And University

3 weeks agoByDDD

Hot Tools

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Notepad++7.3.1

Easy-to-use and free code editor

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software