Why is python XML parsing incompatible?

Question

Why is python XML parsing incompatible?

I am parsing a large (12 GB) XML file made from approximately 135 thousand more similar records (this is a nmap dump). I noticed that the parsing speed is inconsistent, the time for parsing such records varies greatly.

The following reduced code displays the time required to analyze each 1% of the records:

from xml.etree.ElementTree import iterparse import time nrhosts = 0 previous = time.time() context = iterparse("test.xml", events=("start", "end")) context = iter(context) event, root = context.next() for event, elem in context: if event == 'end' and elem.tag == "host": root.clear() # clean up memory nrhosts += 1 if nrhosts % 1349 == 0: # hardcoded to estimate the % as there are ~135k of records now = time.time() print nrhosts // 1349, now - previous previous = now

This gives:

 1 2.43700003624 2 3.13999986649 3 2.87700009346 4 2.59200000763 5 65.8800001144 6 47.6069998741 7 43.6809999943 8 29.7590000629 9 11.8629999161 10 4.52200007439 11 40.0160000324 12 42.2109999657 13 45.9930000305 14 29.1139998436 15 6.18600010872 16 41.7149999142 17 40.3410000801 18 40.0460000038 19 30.2319998741 20 1.45700001717 21 5.35100007057 22 15.4260001183 23 32.7389998436 24 42.7220001221 25 10.4960000515 26 1.28299999237 27 7.33299994469 28 22.7130000591 29 27.3199999332 30 34.4129998684 31 1.71200013161 32 1.63499999046 33 7.06900000572 34 24.1480000019 35 25.7660000324 36 20.8759999275 37 1.29399991035 38 1.34899997711 39 5.71700000763 40 35.9170000553 41 33.8300001621 42 8.69299983978 43 1.35500001907 44 1.3180000782 45 8.44099998474 46 26.1540000439 47 28.768999815 48 5.91400003433 49 1.63499999046 50 1.30800008774 51 5.93499994278

This conclusion seems surprisingly "wavy":

surprisingly wavy http://i.minus.com/ibiIth8t2AFf4t.png :

I would like to emphasize that:

machines that run the code (nothing special happens, which interferes with parsing). I have similar results on a laptop running Win7 and on a virtual machine on ESX running under Debian (similar to the fact that the parsing speed varies greatly).
the records are more or less the same: the XML file is the output from nmap -O , so the amount of information for each record (a <host> in my case) is more or less the same. I want to say that there is nothing in the XML output that makes some of the details "figure out" more.

Anything in my code alluding to this behavior? (I'm using SAX to handle the size of an XML file, maybe there is something inherent in changing the parsing speed?).

My goal is to ultimately understand “this is life,” and just accept the fact or change my code.

Thanks.

+6

performance python xml parsing

Woj Oct 15 '13 at 11:24

source share

2 answers

I know this might be a dumb question, but have you tried using the XML semantics of the XML library? Try to import

 from xml.etree.cElementTree import iterparse

This should give you great speed. If this is not enough, I would try using the lxml XML parser http://lxml.de/

Also, I'm not sure if it can be split into an XML file so that you can use multiprocessing to use multiple processor cores efficiently and then combine the results back into a single data structure.

0

Ian weisberger Nov 04 '13 at 21:41

source share

Javier · Accepted Answer · 2014-02-27T10:46:41+0000

This graph is almost the imprint of the caching system! :-) You read the file in chunks (as defined in the implementation of ElementTree), but the computer reads much more under the assumption that you will need the following chunks in the near future. This means that the next process that you are running will take less time because it is already in memory and so on. However, at some point, the buffer in memory will be almost empty. It is at this point that you need to “wait” for a while to read the following snippets, increasing your measurements.

Why is python XML parsing incompatible?

More articles: