What is the fastest way to parse large XML documents in Python?

I am currently using the following code based on Python Cookbook chapter 12.5:

from xml.parsers import expat class Element(object): def __init__(self, name, attributes): self.name = name self.attributes = attributes self.cdata = '' self.children = [] def addChild(self, element): self.children.append(element) def getAttribute(self,key): return self.attributes.get(key) def getData(self): return self.cdata def getElements(self, name=''): if name: return [c for c in self.children if c.name == name] else: return list(self.children) class Xml2Obj(object): def __init__(self): self.root = None self.nodeStack = [] def StartElement(self, name, attributes): element = Element(name.encode(), attributes) if self.nodeStack: parent = self.nodeStack[-1] parent.addChild(element) else: self.root = element self.nodeStack.append(element) def EndElement(self, name): self.nodeStack.pop() def CharacterData(self,data): if data.strip(): data = data.encode() element = self.nodeStack[-1] element.cdata += data def Parse(self, filename): Parser = expat.ParserCreate() Parser.StartElementHandler = self.StartElement Parser.EndElementHandler = self.EndElement Parser.CharacterDataHandler = self.CharacterData ParserStatus = Parser.Parse(open(filename).read(),1) return self.root 

I work with XML documents about 1 GB in size. Does anyone know a faster way to parse them?

+53
performance python xml parsing
Nov 27 '08 at 16:47
source share
8 answers

I look at me as if you didn't need any DOM features from your program. I would support using the (c) ElementTree library. If you use the iterparse function of the cElementTree module, you can wade through xml and process events as they occur.

Note, however, Fredriks advice on using the iterparse cElementTree function :

to parse large files, you can get rid of the elements as soon as you process them:

 for event, elem in iterparse(source): if elem.tag == "record": ... process record elements ... elem.clear() 

The above pattern has one drawback; it does not clear the root element, so you will get one element with many empty child elements. If your files are huge, not just large, this can be a problem. To get around this, you need to get the root element. The easiest way to do this is to enable start events and save the link to the first element in a variable:

 # get an iterable context = iterparse(source, events=("start", "end")) # turn it into an iterator context = iter(context) # get the root element event, root = context.next() for event, elem in context: if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 

Lxml.iterparse () does not allow this.

The previous one does not work on Python 3.7, consider the following way to get the first element.

 # get an iterable context = iterparse(source, events=("start", "end")) is_first = True for event, elem in context: # get the root element if is_first: root = elm is_first = False if event == "end" and elem.tag == "record": ... process record elements ... root.clear() 
+57
Nov 28 '08 at 20:03
source share

Have you tried the cElementTree module?

cElementTree is included with Python 2.5 and later, as xml.etree.cElementTree. See tests .

dead link ImageShack removed

+15
Nov 27 '08 at 19:00
source share

I recommend you use lxml , this is the python bundle for libxml2 library, which is very fast.

In my experience, libxml2 and expat have very similar performance. But I prefer libxml2 (and lxml for python) because it is more actively developed and tested. Also libxml2 has more features.

lxml is mostly compatible with the xml.etree.ElementTree API. There is good documentation on the website.

+8
Nov 27 '08 at 17:53
source share

Callback registration greatly slows down parsing. [EDIT] This is because (fast) C code must invoke a python interpreter that is not as fast as C. Basically, you use C code to read a file (fast), and then create a DOM in Python (slow). [/ EDIT]

Try using xml.etree.ElementTree, which is 100% implemented in C and which can parse XML without any callbacks for python code.

After analyzing the document, you can filter it to get what you want.

If it's still too slow and you don't need the DOM, another option is to read the file in a line and use simple string operations to process it.

+4
Nov 27 '08 at 16:56
source share

If your application is performance-sensitive and might run into large files (for example, you said> 1 GB), I would strongly advise against using the code that you show in your question, simply because it loads the entire document into RAM. I would advise you to rethink your design (if at all possible) to avoid simultaneously storing the entire document tree in RAM. Not knowing what your application requirements are, I cannot correctly propose any specific approach, other than general tips, to try to use an event-based design.

+4
Nov 27 '08 at 21:30
source share

expat ParseFile works well if you do not need to store the entire tree in memory, which sooner or later will delete your RAM for large files:

 import xml.parsers.expat parser = xml.parsers.expat.ParserCreate() parser.ParseFile(open('path.xml', 'r')) 

It reads the files into pieces and passes them to the parser without breaking the RAM.

Doc: https://docs.python.org/2/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

+1
Nov 19 '15 at 10:50
source share

Apparently PyRXP is really fast.

They claim that this is the fastest parser, but cElementTree is not on its statistics list.

0
Nov 29 '08 at 0:17
source share

I spent quite a bit of time trying this, and it seems that the fastest and least resource-intensive approach is to use lxml and iterparse, but you need to free up unnecessary memory. In my example, parsing an arXiv dump:

 from lxml import etree context = etree.iterparse('path/to/file', events=('end',), tag='Record') for event, element in context: record_id = element.findtext('.//{http://arxiv.org/OAI/arXiv/}id') created = element.findtext('.//{http://arxiv.org/OAI/arXiv/}created') print(record_id, created) # Free memory. element.clear() while element.getprevious() is not None: del element.getparent()[0] 

So element.clear not enough, but also removing any links to previous elements.

0
May 9 '19 at 22:42
source share



All Articles