Home >Backend Development >Python Tutorial >Analyze several ways Python parses XML

Analyze several ways Python parses XML

巴扎黑
巴扎黑Original
2017-09-19 10:20:272656browse

When I first learned PYTHON, I only knew that there were two parsing methods, DOM and SAX, but their efficiency was not ideal. Due to the large number of files that needed to be processed, these two methods were too time-consuming and unacceptable.

After searching on the Internet, I found that ElementTree, which is currently widely used and relatively efficient, is also an algorithm recommended by many people, so I used this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is Normal ElementTree(ET), one is ElementTree.iterparse(ET_iter).

This article will conduct a horizontal comparison of the four methods of DOM, SAX, ET, and ET_iter, and evaluate the efficiency of each algorithm by comparing the time it takes to process the same files.

In the program, the four parsing methods are written as functions and called separately in the main program to evaluate their parsing efficiency.

The decompressed XML file content example is:

Analyze several ways Python parses XML

The main program function call part code is:

  print("文件计数:%d/%d." % (gz_cnt,paser_num))
  str_s,cnt = dom_parser(gz)
  #str_s,cnt = sax_parser(gz)
  #str_s,cnt = ET_parser(gz)
  #str_s,cnt = ET_parser_iter(gz)
  output.write(str_s)
  vs_cnt += cnt

In the initial function call The function returns two values, but when receiving the function call value, it is called with two variables separately, causing each function to be executed twice. It was later modified to call two variables at once to receive the return value, reducing invalid calls.

1. DOM parsing

Function definition code:

def dom_parser(gz):
  import gzip,cStringIO
  import xml.dom.minidom
  
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  doc = xml.dom.minidom.parseString(xm.read())
  bulkPmMrDataFile = doc.documentElement
  #读入子元素
  enbs = bulkPmMrDataFile.getElementsByTagName("eNB")
  measurements = enbs[0].getElementsByTagName("measurement")
  objects = measurements[0].getElementsByTagName("object")
  #写入csv文件
  for object in objects:
    vs = object.getElementsByTagName("v")
    vs_cnt += len(vs)
    for v in vs:
      file_io.write(enbs[0].getAttribute("id")+' '+object.getAttribute("id")+' '+\
      object.getAttribute("MmeUeS1apId")+' '+object.getAttribute("MmeGroupId")+' '+object.getAttribute("MmeCode")+' '+\
      object.getAttribute("TimeStamp")+' '+v.childNodes[0].data+'\n') #获取文本值
  str_s = (((file_io.getvalue().replace(' \n','\r\n')).replace(' ',',')).replace('T',' ')).replace('NIL','')
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

……………………………………………………

File count: 12/12.

Read in:/tmcdata/mro2csv/input31/TD- LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count: 177849, running time: 107.077867, rows processed per second: 1660.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

Program processing ends.

Since DOM parsing requires reading the entire file into memory and establishing a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and there is no need to define a callback function, which is easy to implement.

2. SAX parsing

Function definition code:

def sax_parser(gz):
  import os,gzip,cStringIO
  from xml.parsers.expat import ParserCreate
  #变量声明
  d_eNB = {}
  d_obj = {}
  s = ''
  global flag 
  flag = False
  file_io = cStringIO.StringIO()
  
  #Sax解析类
  class DefaultSaxHandler(object):
    #处理开始标签
    def start_element(self, name, attrs):
      global d_eNB
      global d_obj
      global vs_cnt
      if name == 'eNB':
        d_eNB = attrs
      elif name == 'object':
        d_obj = attrs
      elif name == 'v':
        file_io.write(d_eNB['id']+' '+ d_obj['id']+' '+d_obj['MmeUeS1apId']+' '+d_obj['MmeGroupId']+' '+d_obj['MmeCode']+' '+d_obj['TimeStamp']+' ')
        vs_cnt += 1
      else:
        pass
    #处理中间文本
    def char_data(self, text):
      global d_eNB
      global d_obj
      global flag
      if text[0:1].isnumeric():
        file_io.write(text)
      elif text[0:17] == 'MR.LteScPlrULQci1':
        flag = True
        #print(text,flag)
      else:
        pass
    #处理结束标签
    def end_element(self, name):
      global d_eNB
      global d_obj
      if name == 'v':
        file_io.write('\n')
      else:
        pass
  
  #Sax解析调用
  handler = DefaultSaxHandler()
  parser = ParserCreate()
  parser.StartElementHandler = handler.start_element
  parser.EndElementHandler = handler.end_element
  parser.CharacterDataHandler = handler.char_data
  vs_cnt = 0
  str_s = ''
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  for line in xm.readlines():
    parser.Parse(line) #解析xml文件内容
    if flag:
      break
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

**************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

........................................

File count: 12/12.

Read in: /tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

Parsing:

VS row count :177849, running time: 14.386779, rows processed per second: 12361.

Written:/tmcdata/mro2csv/output31/mro_0001.csv.

************************************************ ******

The program processing ends.

SAX parsing has a significantly shorter running time than DOM parsing. Since SAX uses line-by-line parsing, it takes up less memory for processing larger files. Therefore, SAX parsing is a parsing method that is currently widely used. The disadvantage is that you need to implement the callback function yourself, and the logic is relatively complicated.

3. ET analysis

Function definition code:

def ET_parser(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  tree = ET.ElementTree(file=xm)
  root = tree.getroot()
  for elem in root[1][0].findall('object'):
      for v in elem.findall('v'):
          file_io.write(root[1].attrib['id']+' '+elem.attrib['TimeStamp']+' '+elem.attrib['MmeCode']+' '+\
          elem.attrib['id']+' '+ elem.attrib['MmeUeS1apId']+' '+ elem.attrib['MmeGroupId']+' '+ v.text+'\n')
      vs_cnt += 1
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

Program running result:

****************** *************************************

Program processing starts.

The input directory is:/tmcdata/mro2csv/input31/.

The output directory is:/tmcdata/mro2csv/output31/.

The number of .gz files in the input directory is: 12, 12 of them will be processed this time.

************************************************ ******

File count: 1/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

Parsing:

File count: 2/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

Parsing:

File count: 3/12.

Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

Parsing:

...........................................

文件计数:12/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中:

VS行计数:177849,运行时间:4.308103,每秒处理行数:41282。

已写入:/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

相较于SAX解析,ET解析时间更短,并且函数实现也比较简单,所以ET具有类似DOM的简单逻辑实现且匹敌SAX的解析效率,因此ET是目前XML解析的首选。

4、ET_iter解析

函数定义代码:

def ET_parser_iter(gz):
  import os,gzip,cStringIO
  import xml.etree.cElementTree as ET
  vs_cnt = 0
  str_s = ''
  file_io = cStringIO.StringIO()
  xm = gzip.open(gz,'rb')
  print("已读入:%s.\n解析中:" % (os.path.abspath(gz)))
  d_eNB = {}
  d_obj = {}
  i = 0
  for event,elem in ET.iterparse(xm,events=('start','end')):
    if i >= 2:
      break    
    elif event == 'start':
          if elem.tag == 'eNB':
              d_eNB = elem.attrib
          elif elem.tag == 'object':
        d_obj = elem.attrib
      elif event == 'end' and elem.tag == 'smr':
      i += 1
    elif event == 'end' and elem.tag == 'v':
      file_io.write(d_eNB['id']+' '+d_obj['TimeStamp']+' '+d_obj['MmeCode']+' '+d_obj['id']+' '+\
      d_obj['MmeUeS1apId']+' '+ d_obj['MmeGroupId']+' '+str(elem.text)+'\n')
          vs_cnt += 1
      elem.clear()
  str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','')  #写入解析后内容
  xm.close()
  file_io.close()
  return (str_s,vs_cnt)

程序运行结果:

**************************************************

程序处理启动。

输入目录为:/tmcdata/mro2csv/input31/。

输出目录为:/tmcdata/mro2csv/output31/。

输入目录下.gz文件个数为:12,本次处理其中的12个。

**************************************************

文件计数:1/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.

解析中:

文件计数:2/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.

解析中:

文件计数:3/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.

解析中:

...................................................

文件计数:12/12.

已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.

解析中:

VS行计数:177849,运行时间:3.043805,每秒处理行数:58429。

已写入:/tmcdata/mro2csv/output31/mro_0001.csv。

**************************************************

程序处理结束。

在引入了ET_iter解析后,解析效率比ET提升了近50%,而相较于DOM解析更是提升了35倍,在解析效率提升的同时,由于其采用了iterparse这个循序解析的工具,其内存占用也是比较小的。

The above is the detailed content of Analyze several ways Python parses XML. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn