Home > Article > Backend Development > Analyze several ways Python parses XML
When I first learned PYTHON, I only knew that there were two parsing methods, DOM and SAX, but their efficiency was not ideal. Due to the large number of files that needed to be processed, these two methods were too time-consuming and unacceptable.
After searching on the Internet, I found that ElementTree, which is currently widely used and relatively efficient, is also an algorithm recommended by many people, so I used this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is Normal ElementTree(ET), one is ElementTree.iterparse(ET_iter).
This article will conduct a horizontal comparison of the four methods of DOM, SAX, ET, and ET_iter, and evaluate the efficiency of each algorithm by comparing the time it takes to process the same files.
In the program, the four parsing methods are written as functions and called separately in the main program to evaluate their parsing efficiency.
The decompressed XML file content example is:
The main program function call part code is:
print("文件计数:%d/%d." % (gz_cnt,paser_num)) str_s,cnt = dom_parser(gz) #str_s,cnt = sax_parser(gz) #str_s,cnt = ET_parser(gz) #str_s,cnt = ET_parser_iter(gz) output.write(str_s) vs_cnt += cnt
In the initial function call The function returns two values, but when receiving the function call value, it is called with two variables separately, causing each function to be executed twice. It was later modified to call two variables at once to receive the return value, reducing invalid calls.
Function definition code:
def dom_parser(gz): import gzip,cStringIO import xml.dom.minidom vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) doc = xml.dom.minidom.parseString(xm.read()) bulkPmMrDataFile = doc.documentElement #读入子元素 enbs = bulkPmMrDataFile.getElementsByTagName("eNB") measurements = enbs[0].getElementsByTagName("measurement") objects = measurements[0].getElementsByTagName("object") #写入csv文件 for object in objects: vs = object.getElementsByTagName("v") vs_cnt += len(vs) for v in vs: file_io.write(enbs[0].getAttribute("id")+' '+object.getAttribute("id")+' '+\ object.getAttribute("MmeUeS1apId")+' '+object.getAttribute("MmeGroupId")+' '+object.getAttribute("MmeCode")+' '+\ object.getAttribute("TimeStamp")+' '+v.childNodes[0].data+'\n') #获取文本值 str_s = (((file_io.getvalue().replace(' \n','\r\n')).replace(' ',',')).replace('T',' ')).replace('NIL','') xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
**************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
……………………………………………………
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD- LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, running time: 107.077867, rows processed per second: 1660.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
************************************************ ******
Program processing ends.
Since DOM parsing requires reading the entire file into memory and establishing a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and there is no need to define a callback function, which is easy to implement.
Function definition code:
def sax_parser(gz): import os,gzip,cStringIO from xml.parsers.expat import ParserCreate #变量声明 d_eNB = {} d_obj = {} s = '' global flag flag = False file_io = cStringIO.StringIO() #Sax解析类 class DefaultSaxHandler(object): #处理开始标签 def start_element(self, name, attrs): global d_eNB global d_obj global vs_cnt if name == 'eNB': d_eNB = attrs elif name == 'object': d_obj = attrs elif name == 'v': file_io.write(d_eNB['id']+' '+ d_obj['id']+' '+d_obj['MmeUeS1apId']+' '+d_obj['MmeGroupId']+' '+d_obj['MmeCode']+' '+d_obj['TimeStamp']+' ') vs_cnt += 1 else: pass #处理中间文本 def char_data(self, text): global d_eNB global d_obj global flag if text[0:1].isnumeric(): file_io.write(text) elif text[0:17] == 'MR.LteScPlrULQci1': flag = True #print(text,flag) else: pass #处理结束标签 def end_element(self, name): global d_eNB global d_obj if name == 'v': file_io.write('\n') else: pass #Sax解析调用 handler = DefaultSaxHandler() parser = ParserCreate() parser.StartElementHandler = handler.start_element parser.EndElementHandler = handler.end_element parser.CharacterDataHandler = handler.char_data vs_cnt = 0 str_s = '' xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) for line in xm.readlines(): parser.Parse(line) #解析xml文件内容 if flag: break str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
**************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
........................................
File count: 12/12.
Read in: /tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count :177849, running time: 14.386779, rows processed per second: 12361.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
************************************************ ******
The program processing ends.
SAX parsing has a significantly shorter running time than DOM parsing. Since SAX uses line-by-line parsing, it takes up less memory for processing larger files. Therefore, SAX parsing is a parsing method that is currently widely used. The disadvantage is that you need to implement the callback function yourself, and the logic is relatively complicated.
Function definition code:
def ET_parser(gz): import os,gzip,cStringIO import xml.etree.cElementTree as ET vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) tree = ET.ElementTree(file=xm) root = tree.getroot() for elem in root[1][0].findall('object'): for v in elem.findall('v'): file_io.write(root[1].attrib['id']+' '+elem.attrib['TimeStamp']+' '+elem.attrib['MmeCode']+' '+\ elem.attrib['id']+' '+ elem.attrib['MmeUeS1apId']+' '+ elem.attrib['MmeGroupId']+' '+ v.text+'\n') vs_cnt += 1 str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
****************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
...........................................
文件计数:12/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
解析中:
VS行计数:177849,运行时间:4.308103,每秒处理行数:41282。
已写入:/tmcdata/mro2csv/output31/mro_0001.csv。
**************************************************
程序处理结束。
相较于SAX解析,ET解析时间更短,并且函数实现也比较简单,所以ET具有类似DOM的简单逻辑实现且匹敌SAX的解析效率,因此ET是目前XML解析的首选。
函数定义代码:
def ET_parser_iter(gz): import os,gzip,cStringIO import xml.etree.cElementTree as ET vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) d_eNB = {} d_obj = {} i = 0 for event,elem in ET.iterparse(xm,events=('start','end')): if i >= 2: break elif event == 'start': if elem.tag == 'eNB': d_eNB = elem.attrib elif elem.tag == 'object': d_obj = elem.attrib elif event == 'end' and elem.tag == 'smr': i += 1 elif event == 'end' and elem.tag == 'v': file_io.write(d_eNB['id']+' '+d_obj['TimeStamp']+' '+d_obj['MmeCode']+' '+d_obj['id']+' '+\ d_obj['MmeUeS1apId']+' '+ d_obj['MmeGroupId']+' '+str(elem.text)+'\n') vs_cnt += 1 elem.clear() str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
程序运行结果:
**************************************************
程序处理启动。
输入目录为:/tmcdata/mro2csv/input31/。
输出目录为:/tmcdata/mro2csv/output31/。
输入目录下.gz文件个数为:12,本次处理其中的12个。
**************************************************
文件计数:1/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
解析中:
文件计数:2/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
解析中:
文件计数:3/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
解析中:
...................................................
文件计数:12/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
解析中:
VS行计数:177849,运行时间:3.043805,每秒处理行数:58429。
已写入:/tmcdata/mro2csv/output31/mro_0001.csv。
**************************************************
程序处理结束。
在引入了ET_iter解析后,解析效率比ET提升了近50%,而相较于DOM解析更是提升了35倍,在解析效率提升的同时,由于其采用了iterparse这个循序解析的工具,其内存占用也是比较小的。
The above is the detailed content of Analyze several ways Python parses XML. For more information, please follow other related articles on the PHP Chinese website!