When I first learned PYTHON, I only knew that there were two parsing methods, DOM and SAX, but their efficiency was not ideal. Due to the large number of files that needed to be processed, these two methods were too time-consuming and unacceptable.
After searching on the Internet, I found that ElementTree, which is currently widely used and relatively efficient, is also an algorithm recommended by many people, so I used this algorithm for actual measurement and comparison. ElementTree also includes two implementations, one is Normal ElementTree(ET), one is ElementTree.iterparse(ET_iter).
This article will conduct a horizontal comparison of the four methods of DOM, SAX, ET, and ET_iter, and evaluate the efficiency of each algorithm by comparing the time it takes to process the same files.
In the program, the four parsing methods are written as functions and called separately in the main program to evaluate their parsing efficiency.
The decompressed XML file content example is:
The main program function call part code is:
print("文件计数:%d/%d." % (gz_cnt,paser_num)) str_s,cnt = dom_parser(gz) #str_s,cnt = sax_parser(gz) #str_s,cnt = ET_parser(gz) #str_s,cnt = ET_parser_iter(gz) output.write(str_s) vs_cnt += cnt
In the initial function call The function returns two values, but when receiving the function call value, it is called with two variables separately, causing each function to be executed twice. It was later modified to call two variables at once to receive the return value, reducing invalid calls.
1. DOM parsing
Function definition code:
def dom_parser(gz): import gzip,cStringIO import xml.dom.minidom vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) doc = xml.dom.minidom.parseString(xm.read()) bulkPmMrDataFile = doc.documentElement #读入子元素 enbs = bulkPmMrDataFile.getElementsByTagName("eNB") measurements = enbs[0].getElementsByTagName("measurement") objects = measurements[0].getElementsByTagName("object") #写入csv文件 for object in objects: vs = object.getElementsByTagName("v") vs_cnt += len(vs) for v in vs: file_io.write(enbs[0].getAttribute("id")+' '+object.getAttribute("id")+' '+\ object.getAttribute("MmeUeS1apId")+' '+object.getAttribute("MmeGroupId")+' '+object.getAttribute("MmeCode")+' '+\ object.getAttribute("TimeStamp")+' '+v.childNodes[0].data+'\n') #获取文本值 str_s = (((file_io.getvalue().replace(' \n','\r\n')).replace(' ',',')).replace('T',' ')).replace('NIL','') xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
**************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
……………………………………………………
File count: 12/12.
Read in:/tmcdata/mro2csv/input31/TD- LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count: 177849, running time: 107.077867, rows processed per second: 1660.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
************************************************ ******
Program processing ends.
Since DOM parsing requires reading the entire file into memory and establishing a tree structure, its memory consumption and time consumption are relatively high, but its advantage is that the logic is simple and there is no need to define a callback function, which is easy to implement.
2. SAX parsing
Function definition code:
def sax_parser(gz): import os,gzip,cStringIO from xml.parsers.expat import ParserCreate #变量声明 d_eNB = {} d_obj = {} s = '' global flag flag = False file_io = cStringIO.StringIO() #Sax解析类 class DefaultSaxHandler(object): #处理开始标签 def start_element(self, name, attrs): global d_eNB global d_obj global vs_cnt if name == 'eNB': d_eNB = attrs elif name == 'object': d_obj = attrs elif name == 'v': file_io.write(d_eNB['id']+' '+ d_obj['id']+' '+d_obj['MmeUeS1apId']+' '+d_obj['MmeGroupId']+' '+d_obj['MmeCode']+' '+d_obj['TimeStamp']+' ') vs_cnt += 1 else: pass #处理中间文本 def char_data(self, text): global d_eNB global d_obj global flag if text[0:1].isnumeric(): file_io.write(text) elif text[0:17] == 'MR.LteScPlrULQci1': flag = True #print(text,flag) else: pass #处理结束标签 def end_element(self, name): global d_eNB global d_obj if name == 'v': file_io.write('\n') else: pass #Sax解析调用 handler = DefaultSaxHandler() parser = ParserCreate() parser.StartElementHandler = handler.start_element parser.EndElementHandler = handler.end_element parser.CharacterDataHandler = handler.char_data vs_cnt = 0 str_s = '' xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) for line in xm.readlines(): parser.Parse(line) #解析xml文件内容 if flag: break str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
**************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
........................................
File count: 12/12.
Read in: /tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
Parsing:
VS row count :177849, running time: 14.386779, rows processed per second: 12361.
Written:/tmcdata/mro2csv/output31/mro_0001.csv.
************************************************ ******
The program processing ends.
SAX parsing has a significantly shorter running time than DOM parsing. Since SAX uses line-by-line parsing, it takes up less memory for processing larger files. Therefore, SAX parsing is a parsing method that is currently widely used. The disadvantage is that you need to implement the callback function yourself, and the logic is relatively complicated.
3. ET analysis
Function definition code:
def ET_parser(gz): import os,gzip,cStringIO import xml.etree.cElementTree as ET vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) tree = ET.ElementTree(file=xm) root = tree.getroot() for elem in root[1][0].findall('object'): for v in elem.findall('v'): file_io.write(root[1].attrib['id']+' '+elem.attrib['TimeStamp']+' '+elem.attrib['MmeCode']+' '+\ elem.attrib['id']+' '+ elem.attrib['MmeUeS1apId']+' '+ elem.attrib['MmeGroupId']+' '+ v.text+'\n') vs_cnt += 1 str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
Program running result:
****************** *************************************
Program processing starts.
The input directory is:/tmcdata/mro2csv/input31/.
The output directory is:/tmcdata/mro2csv/output31/.
The number of .gz files in the input directory is: 12, 12 of them will be processed this time.
************************************************ ******
File count: 1/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
Parsing:
File count: 2/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
Parsing:
File count: 3/12.
Read in:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
Parsing:
...........................................
文件计数:12/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
解析中:
VS行计数:177849,运行时间:4.308103,每秒处理行数:41282。
已写入:/tmcdata/mro2csv/output31/mro_0001.csv。
**************************************************
程序处理结束。
相较于SAX解析,ET解析时间更短,并且函数实现也比较简单,所以ET具有类似DOM的简单逻辑实现且匹敌SAX的解析效率,因此ET是目前XML解析的首选。
4、ET_iter解析
函数定义代码:
def ET_parser_iter(gz): import os,gzip,cStringIO import xml.etree.cElementTree as ET vs_cnt = 0 str_s = '' file_io = cStringIO.StringIO() xm = gzip.open(gz,'rb') print("已读入:%s.\n解析中:" % (os.path.abspath(gz))) d_eNB = {} d_obj = {} i = 0 for event,elem in ET.iterparse(xm,events=('start','end')): if i >= 2: break elif event == 'start': if elem.tag == 'eNB': d_eNB = elem.attrib elif elem.tag == 'object': d_obj = elem.attrib elif event == 'end' and elem.tag == 'smr': i += 1 elif event == 'end' and elem.tag == 'v': file_io.write(d_eNB['id']+' '+d_obj['TimeStamp']+' '+d_obj['MmeCode']+' '+d_obj['id']+' '+\ d_obj['MmeUeS1apId']+' '+ d_obj['MmeGroupId']+' '+str(elem.text)+'\n') vs_cnt += 1 elem.clear() str_s = file_io.getvalue().replace(' \n','\r\n').replace(' ',',').replace('T',' ').replace('NIL','') #写入解析后内容 xm.close() file_io.close() return (str_s,vs_cnt)
程序运行结果:
**************************************************
程序处理启动。
输入目录为:/tmcdata/mro2csv/input31/。
输出目录为:/tmcdata/mro2csv/output31/。
输入目录下.gz文件个数为:12,本次处理其中的12个。
**************************************************
文件计数:1/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_234598_20160224060000.xml.gz.
解析中:
文件计数:2/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_233798_20160224060000.xml.gz.
解析中:
文件计数:3/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_123798_20160224060000.xml.gz.
解析中:
...................................................
文件计数:12/12.
已读入:/tmcdata/mro2csv/input31/TD-LTE_MRO_NSN_OMC_235598_20160224060000.xml.gz.
解析中:
VS行计数:177849,运行时间:3.043805,每秒处理行数:58429。
已写入:/tmcdata/mro2csv/output31/mro_0001.csv。
**************************************************
程序处理结束。
在引入了ET_iter解析后,解析效率比ET提升了近50%,而相较于DOM解析更是提升了35倍,在解析效率提升的同时,由于其采用了iterparse这个循序解析的工具,其内存占用也是比较小的。
The above is the detailed content of Analyze several ways Python parses XML. For more information, please follow other related articles on the PHP Chinese website!

The reasons why Python scripts cannot run on Unix systems include: 1) Insufficient permissions, using chmod xyour_script.py to grant execution permissions; 2) Shebang line is incorrect or missing, you should use #!/usr/bin/envpython; 3) The environment variables are not set properly, and you can print os.environ debugging; 4) Using the wrong Python version, you can specify the version on the Shebang line or the command line; 5) Dependency problems, using virtual environment to isolate dependencies; 6) Syntax errors, using python-mpy_compileyour_script.py to detect.

Using Python arrays is more suitable for processing large amounts of numerical data than lists. 1) Arrays save more memory, 2) Arrays are faster to operate by numerical values, 3) Arrays force type consistency, 4) Arrays are compatible with C arrays, but are not as flexible and convenient as lists.

Listsare Better ForeflexibilityandMixdatatatypes, Whilearraysares Superior Sumerical Computation Sand Larged Datasets.1) Unselable List Xibility, MixedDatatypes, andfrequent elementchanges.2) Usarray's sensory -sensical operations, Largedatasets, AndwhenMemoryEfficiency

NumPymanagesmemoryforlargearraysefficientlyusingviews,copies,andmemory-mappedfiles.1)Viewsallowslicingwithoutcopying,directlymodifyingtheoriginalarray.2)Copiescanbecreatedwiththecopy()methodforpreservingdata.3)Memory-mappedfileshandlemassivedatasetsb

ListsinPythondonotrequireimportingamodule,whilearraysfromthearraymoduledoneedanimport.1)Listsarebuilt-in,versatile,andcanholdmixeddatatypes.2)Arraysaremorememory-efficientfornumericdatabutlessflexible,requiringallelementstobeofthesametype.

Pythonlistscanstoreanydatatype,arraymodulearraysstoreonetype,andNumPyarraysarefornumericalcomputations.1)Listsareversatilebutlessmemory-efficient.2)Arraymodulearraysarememory-efficientforhomogeneousdata.3)NumPyarraysareoptimizedforperformanceinscient

WhenyouattempttostoreavalueofthewrongdatatypeinaPythonarray,you'llencounteraTypeError.Thisisduetothearraymodule'sstricttypeenforcement,whichrequiresallelementstobeofthesametypeasspecifiedbythetypecode.Forperformancereasons,arraysaremoreefficientthanl

Pythonlistsarepartofthestandardlibrary,whilearraysarenot.Listsarebuilt-in,versatile,andusedforstoringcollections,whereasarraysareprovidedbythearraymoduleandlesscommonlyusedduetolimitedfunctionality.


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Notepad++7.3.1
Easy-to-use and free code editor

DVWA
Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

SublimeText3 Mac version
God-level code editing software (SublimeText3)

SublimeText3 English version
Recommended: Win version, supports code prompts!
