Home >Backend Development >Python Tutorial >Performance comparison test of PyPy and CPython
Recently I completed some data mining tasks on Wikipedia. It consists of these parts:
Parsing the Wikipedia dump of enwiki-pages-articles.xml;
Storing categories and pages into MongoDB;
Re-categorizing category names.
I tested the performance of CPython 2.7.3 and PyPy 2b on real tasks. The libraries I used are:
redis 2.7.2
pymongo 2.4.2
Also CPython is supported by the following libraries:
hiredis
pymongo c-extensions
The test mainly consists of database parsing, so I didn't expect that How much benefit will you get from PyPy (not to mention that CPython’s database driver is written in C).
Below I will describe some interesting results.
Extract wiki page names
I need to create a join of wiki page names to page.id in all Wikipedia categories and store the reassigned ones. The simplest solution should be to import enwiki-page.sql (which defines an RDB table) into MySQL, then transfer the data and redistribute it. But I didn't want to increase MySQL requirements (have backbone! XD) so I wrote a simple SQL insert statement parser in pure Python, and then imported the data directly from enwiki-page.sql and redistributed it.
This task relies more on the CPU, so I am optimistic about PyPy again.
/ time
PyPy 169.00s User mode 8.52s System mode 90% CPU
CPython 1287.13s User mode 8.10s System mode 96% CPU
I also made a similar join for page.id->category (I The laptop's memory is too small to hold the information for my testing).
Filter categories from enwiki. Therefore I chose a SAX parser, a wrapper parser that works in both PyPy and CPython. External native compilation package (colleagues in PyPy and CPython).
The code is very simple:
class WikiCategoryHandler(handler.ContentHandler): """Class which detecs category pages and stores them separately """ ignored = set(('contributor', 'comment', 'meta')) def __init__(self, f_out): handler.ContentHandler.__init__(self) self.f_out = f_out self.curr_page = None self.curr_tag = '' self.curr_elem = Element('root', {}) self.root = self.curr_elem self.stack = Stack() self.stack.push(self.curr_elem) self.skip = 0 def startElement(self, name, attrs): if self.skip>0 or name in self.ignored: self.skip += 1 return self.curr_tag = name elem = Element(name, attrs) if name == 'page': elem.ns = -1 self.curr_page = elem else: # we don't want to keep old pages in memory self.curr_elem.append(elem) self.stack.push(elem) self.curr_elem = elem def endElement(self, name): if self.skip>0: self.skip -= 1 return if name == 'page': self.task() self.curr_page = None self.stack.pop() self.curr_elem = self.stack.top() self.curr_tag = self.curr_elem.tag def characters(self, content): if content.isspace(): return if self.skip == 0: self.curr_elem.append(TextElement(content)) if self.curr_tag == 'ns': self.curr_page.ns = int(content) def startDocument(self): self.f_out.write("<root>\n") def endDocument(self): self.f_out.write("<\root>\n") print("FINISH PROCESSING WIKIPEDIA") def task(self): if self.curr_page.ns == 14: self.f_out.write(self.curr_page.render()) class Element(object): def __init__(self, tag, attrs): self.tag = tag self.attrs = attrs self.childrens = [] self.append = self.childrens.append def __repr__(self): return "Element {}".format(self.tag) def render(self, margin=0): if not self.childrens: return u"{0}<{1}{2} />".format( " "*margin, self.tag, "".join([' {}="{}"'.format(k,v) for k,v in {}.iteritems()])) if isinstance(self.childrens[0], TextElement) and len(self.childrens)==1: return u"{0}<{1}{2}>{3}</{1}>".format( " "*margin, self.tag, "".join([u' {}="{}"'.format(k,v) for k,v in {}.iteritems()]), self.childrens[0].render()) return u"{0}<{1}{2}>\n{3}\n{0}</{1}>".format( " "*margin, self.tag, "".join([u' {}="{}"'.format(k,v) for k,v in {}.iteritems()]), "\n".join((c.render(margin+2) for c in self.childrens))) class TextElement(object): def __init__(self, content): self.content = content def __repr__(self): return "TextElement" def render(self, margin=0): return self.contentElement and TextElement elements include tag and body information, and provide a method to render it. The following is the comparison result of PyPy and CPython that I want. /timePyPy 2169.90sCPython 4494.69sI was very surprised by the results of PyPy. Computing an interesting set of categories
I once wanted to calculate an interesting set of categories - in the context of one of my applications, starting with some categories derived from the Computing category. To do this I need to build a class diagram that provides classes - a subclass diagram.
Structure Class-Subclass Relationship Diagram
This task uses MongoDB as the data source and redistributes the structure. The algorithm is:
for each category.id in redis_categories (it holds *category.id -> category title mapping*) do: title = redis_categories.get(category.id) parent_categories = mongodb get categories for title for each parent_cat in parent categories do: redis_tree.sadd(parent_cat, title) # add to parent_cat set title
Sorry for writing such pseudo code, but I want it to look more compact.
Traversing redis_tree (redistributed tree)
If we have a redis_tree database, the only remaining problem is to traverse all achievable nodes under the Computing category. To avoid loop traversal, we need to record the nodes that have been visited. Since I wanted to test Python's database performance, I solved this problem by redistributing collection columns.
/ time
Conclusion
The tests conducted are just a preview of my final work. It requires a body of knowledge, a body of knowledge that I get from extracting the appropriate content from Wikipedia.