Pandas read_xml () method test methods

Interestingly, pandas I / O tools do not support the read_xml() method and the read_xml() equivalent. However, read_json proves that tree structures can be implemented to import data and read_html for markup formats.

Now, if the Pandas team really considers such a read_xml method for a future version of Pandas, what implementation will they use: parsing with built-in xml.etree.ElementTree with its iterfind() or iterparse() functions or a third-party module, lxml with its XPath 1.0 methods and XSLT 1.0?

Below are my test runs for four types of methods on simple flat element-oriented XML input. All of them are configured for generalized parsing for any children of the second level root, and each method should produce the same frame of pandas data. All but the last call pd.Dataframe() in the dictionary list. The XSLT method converts XML to CSV to cast StringIO() to pd.read_csv() .

Question (in several parts)

  • EXECUTION: How do you explain the slower iterparse , often recommended for large files, since the file is parsed iteratively? This is partly due to checking if logic?

  • MEMORY: Does CPU memory correlate with timings in I / O calls? XSLT and XPath 1.0 tend to not scale well for large XML documents, as the entire file must be read in memory for analysis.

  • STRATEGY: Is a dictionary list an optimal strategy for calling Dataframe() ? Check out these interesting answers: the generator version and the custom version of itawalk. Both upcast lists per data frame.

Data entry (current Stackoverflow users by year , among which are our friends from pandas)

 <?xml version="1.0" encoding="utf-8"?> <stackoverflow> <topusers> <user>Gordon Linoff</user> <link>http://www.stackoverflow.com//users/1144035/gordon-linoff</link> <location>New York, United States</location> <year_rep>5,985</year_rep> <total_rep>499,408</total_rep> <tag1>sql</tag1> <tag2>sql-server</tag2> <tag3>mysql</tag3> </topusers> <topusers> <user>Günter Zöchbauer</user> <link>http://www.stackoverflow.com//users/217408/g%c3%bcnter-z%c3%b6chbauer</link> <location>Linz, Austria</location> <year_rep>5,835</year_rep> <total_rep>154,439</total_rep> <tag1>angular2</tag1> <tag2>typescript</tag2> <tag3>javascript</tag3> </topusers> <topusers> <user>jezrael</user> <link>http://www.stackoverflow.com//users/2901002/jezrael</link> <location>Bratislava, Slovakia</location> <year_rep>5,740</year_rep> <total_rep>83,237</total_rep> <tag1>pandas</tag1> <tag2>python</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>VonC</user> <link>http://www.stackoverflow.com//users/6309/vonc</link> <location>France</location> <year_rep>5,577</year_rep> <total_rep>651,397</total_rep> <tag1>git</tag1> <tag2>github</tag2> <tag3>docker</tag3> </topusers> <topusers> <user>Martijn Pieters</user> <link>http://www.stackoverflow.com//users/100297/martijn-pieters</link> <location>Cambridge, United Kingdom</location> <year_rep>5,337</year_rep> <total_rep>525,176</total_rep> <tag1>python</tag1> <tag2>python-3.x</tag2> <tag3>python-2.7</tag3> </topusers> <topusers> <user>TJ Crowder</user> <link>http://www.stackoverflow.com//users/157247/tj-crowder</link> <location>United Kingdom</location> <year_rep>5,258</year_rep> <total_rep>508,310</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>java</tag3> </topusers> <topusers> <user>akrun</user> <link>http://www.stackoverflow.com//users/3732271/akrun</link> <location></location> <year_rep>5,188</year_rep> <total_rep>229,553</total_rep> <tag1>r</tag1> <tag2>dplyr</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>Wiktor Stribi?ew</user> <link>http://www.stackoverflow.com//users/3832970/wiktor-stribi%c5%bcew</link> <location>Warsaw, Poland</location> <year_rep>4,948</year_rep> <total_rep>158,134</total_rep> <tag1>regex</tag1> <tag2>javascript</tag2> <tag3>c#</tag3> </topusers> <topusers> <user>Darin Dimitrov</user> <link>http://www.stackoverflow.com//users/29407/darin-dimitrov</link> <location>Sofia, Bulgaria</location> <year_rep>4,936</year_rep> <total_rep>709,683</total_rep> <tag1>c#</tag1> <tag2>asp.net-mvc</tag2> <tag3>asp.net-mvc-3</tag3> </topusers> <topusers> <user>Eric Duminil</user> <link>http://www.stackoverflow.com//users/6419007/eric-duminil</link> <location></location> <year_rep>4,854</year_rep> <total_rep>12,557</total_rep> <tag1>ruby</tag1> <tag2>ruby-on-rails</tag2> <tag3>arrays</tag3> </topusers> <topusers> <user>alecxe</user> <link>http://www.stackoverflow.com//users/771848/alecxe</link> <location>New York, United States</location> <year_rep>4,723</year_rep> <total_rep>233,368</total_rep> <tag1>python</tag1> <tag2>selenium</tag2> <tag3>protractor</tag3> </topusers> <topusers> <user>Jean-François Fabre</user> <link>http://www.stackoverflow.com//users/6451573/jean-fran%c3%a7ois-fabre</link> <location>Toulouse, France</location> <year_rep>4,526</year_rep> <total_rep>30,027</total_rep> <tag1>python</tag1> <tag2>python-3.x</tag2> <tag3>python-2.7</tag3> </topusers> <topusers> <user>piRSquared</user> <link>http://www.stackoverflow.com//users/2336654/pirsquared</link> <location>Bellevue, WA, United States</location> <year_rep>4,482</year_rep> <total_rep>41,183</total_rep> <tag1>pandas</tag1> <tag2>python</tag2> <tag3>dataframe</tag3> </topusers> <topusers> <user>CommonsWare</user> <link>http://www.stackoverflow.com//users/115145/commonsware</link> <location>Who Wants to Know?</location> <year_rep>4,475</year_rep> <total_rep>616,135</total_rep> <tag1>android</tag1> <tag2>java</tag2> <tag3>android-intent</tag3> </topusers> <topusers> <user>Quentin</user> <link>http://www.stackoverflow.com//users/19068/quentin</link> <location>United Kingdom</location> <year_rep>4,464</year_rep> <total_rep>509,365</total_rep> <tag1>javascript</tag1> <tag2>html</tag2> <tag3>css</tag3> </topusers> <topusers> <user>Jon Skeet</user> <link>http://www.stackoverflow.com//users/22656/jon-skeet</link> <location>Reading, United Kingdom</location> <year_rep>4,348</year_rep> <total_rep>921,690</total_rep> <tag1>c#</tag1> <tag2>java</tag2> <tag3>.net</tag3> </topusers> <topusers> <user>Felix Kling</user> <link>http://www.stackoverflow.com//users/218196/felix-kling</link> <location>Sunnyvale, CA</location> <year_rep>4,324</year_rep> <total_rep>411,535</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>asynchronous</tag3> </topusers> <topusers> <user>matt</user> <link>http://www.stackoverflow.com//users/341994/matt</link> <location></location> <year_rep>4,313</year_rep> <total_rep>220,515</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>xcode</tag3> </topusers> <topusers> <user>Psidom</user> <link>http://www.stackoverflow.com//users/4983450/psidom</link> <location>Atlanta, GA, United States</location> <year_rep>4,236</year_rep> <total_rep>36,950</total_rep> <tag1>python</tag1> <tag2>pandas</tag2> <tag3>r</tag3> </topusers> <topusers> <user>Martin R</user> <link>http://www.stackoverflow.com//users/1187415/martin-r</link> <location>Germany</location> <year_rep>4,195</year_rep> <total_rep>269,380</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>swift3</tag3> </topusers> <topusers> <user>Barmar</user> <link>http://www.stackoverflow.com//users/1491895/barmar</link> <location>Arlington, MA</location> <year_rep>4,179</year_rep> <total_rep>289,989</total_rep> <tag1>javascript</tag1> <tag2>php</tag2> <tag3>jquery</tag3> </topusers> <topusers> <user>Alexey Mezenin</user> <link>http://www.stackoverflow.com//users/1227923/alexey-mezenin</link> <location>??????</location> <year_rep>4,142</year_rep> <total_rep>31,602</total_rep> <tag1>laravel</tag1> <tag2>php</tag2> <tag3>laravel-5.3</tag3> </topusers> <topusers> <user>BalusC</user> <link>http://www.stackoverflow.com//users/157882/balusc</link> <location>Amsterdam, Netherlands</location> <year_rep>4,046</year_rep> <total_rep>703,046</total_rep> <tag1>java</tag1> <tag2>jsf</tag2> <tag3>servlets</tag3> </topusers> <topusers> <user>GurV</user> <link>http://www.stackoverflow.com//users/6348498/gurv</link> <location></location> <year_rep>4,016</year_rep> <total_rep>7,932</total_rep> <tag1>sql</tag1> <tag2>mysql</tag2> <tag3>sql-server</tag3> </topusers> <topusers> <user>Nina Scholz</user> <link>http://www.stackoverflow.com//users/1447675/nina-scholz</link> <location>Berlin, Deutschland</location> <year_rep>3,950</year_rep> <total_rep>61,135</total_rep> <tag1>javascript</tag1> <tag2>arrays</tag2> <tag3>object</tag3> </topusers> <topusers> <user>JB Nizet</user> <link>http://www.stackoverflow.com//users/571407/jb-nizet</link> <location>Saint-Etienne, France</location> <year_rep>3,923</year_rep> <total_rep>418,780</total_rep> <tag1>java</tag1> <tag2>hibernate</tag2> <tag3>java-8</tag3> </topusers> <topusers> <user>Frank van Puffelen</user> <link>http://www.stackoverflow.com//users/209103/frank-van-puffelen</link> <location>San Francisco, CA</location> <year_rep>3,920</year_rep> <total_rep>86,520</total_rep> <tag1>firebase</tag1> <tag2>firebase-database</tag2> <tag3>android</tag3> </topusers> <topusers> <user>dasblinkenlight</user> <link>http://www.stackoverflow.com//users/335858/dasblinkenlight</link> <location>United States</location> <year_rep>3,886</year_rep> <total_rep>475,813</total_rep> <tag1>c#</tag1> <tag2>java</tag2> <tag3>c++</tag3> </topusers> <topusers> <user>Tim Biegeleisen</user> <link>http://www.stackoverflow.com//users/1863229/tim-biegeleisen</link> <location>Singapore</location> <year_rep>3,814</year_rep> <total_rep>77,211</total_rep> <tag1>sql</tag1> <tag2>mysql</tag2> <tag3>java</tag3> </topusers> <topusers> <user>Greg Hewgill</user> <link>http://www.stackoverflow.com//users/893/greg-hewgill</link> <location>Christchurch, New Zealand</location> <year_rep>3,796</year_rep> <total_rep>529,137</total_rep> <tag1>git</tag1> <tag2>python</tag2> <tag3>git-pull</tag3> </topusers> <topusers> <user>unutbu</user> <link>http://www.stackoverflow.com//users/190597/unutbu</link> <location></location> <year_rep>3,735</year_rep> <total_rep>401,595</total_rep> <tag1>python</tag1> <tag2>pandas</tag2> <tag3>numpy</tag3> </topusers> <topusers> <user>Hans Passant</user> <link>http://www.stackoverflow.com//users/17034/hans-passant</link> <location>Madison, WI</location> <year_rep>3,688</year_rep> <total_rep>672,118</total_rep> <tag1>c#</tag1> <tag2>.net</tag2> <tag3>winforms</tag3> </topusers> <topusers> <user>Jonathan Leffler</user> <link>http://www.stackoverflow.com//users/15168/jonathan-leffler</link> <location>California, USA</location> <year_rep>3,649</year_rep> <total_rep>455,157</total_rep> <tag1>c</tag1> <tag2>bash</tag2> <tag3>unix</tag3> </topusers> <topusers> <user>paxdiablo</user> <link>http://www.stackoverflow.com//users/14860/paxdiablo</link> <location></location> <year_rep>3,636</year_rep> <total_rep>507,043</total_rep> <tag1>c</tag1> <tag2>c++</tag2> <tag3>bash</tag3> </topusers> <topusers> <user>Pranav C Balan</user> <link>http://www.stackoverflow.com//users/3037257/pranav-c-balan</link> <location>Ramanthali, Kannur, Kerala, India</location> <year_rep>3,604</year_rep> <total_rep>64,476</total_rep> <tag1>javascript</tag1> <tag2>jquery</tag2> <tag3>html</tag3> </topusers> <topusers> <user>Suragch</user> <link>http://www.stackoverflow.com//users/3681880/suragch</link> <location>Hohhot, China</location> <year_rep>3,580</year_rep> <total_rep>71,032</total_rep> <tag1>swift</tag1> <tag2>ios</tag2> <tag3>android</tag3> </topusers> </stackoverflow> 

Python Methods

 import xml.etree.ElementTree as et import pandas as pd from io import StringIO from lxml import etree as lxet def read_xml_iterfind(): tree = et.parse('Input.xml') data = [] inner = {} for el in tree.iterfind('./*'): for i in el.iterfind('*'): inner[i.tag] = i.text data.append(inner) inner = {} df = pd.DataFrame(data) def read_xml_iterparse(): data = [] inner = {} i = 1 for (ev, el) in et.iterparse(path): if i <= 2: first_tag = el.tag if el.tag == first_tag and len(inner) != 0: data.append(inner) inner = {} if el.text is not None and len(el.text.strip()) > 0: inner[el.tag] = el.text i += 1 df = pd.DataFrame(data) def read_xml_lxml_xpath(): tree = lxet.parse('Input.xml') data = [] inner = {} for el in tree.xpath('/*/*'): for i in el: inner[i.tag] = i.text data.append(inner) inner = {} df = pd.DataFrame(data) def read_xml_lxml_xsl(): xml = lxet.parse('Input.xml') xslstr = ''' <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/> <xsl:strip-space elements="*"/> <!-- HEADERS --> <xsl:template match = "/*"> <xsl:for-each select="*[1]/*"> <xsl:value-of select="local-name()" /> <xsl:choose> <xsl:when test="position() != last()"> <xsl:text>,</xsl:text> </xsl:when> <xsl:otherwise> <xsl:text>&#xa;</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:for-each> <xsl:apply-templates/> </xsl:template> <!-- DATA ROWS (COMMA-SEPARATED) --> <xsl:template match="/*/*" priority="2"> <xsl:for-each select="*"> <xsl:if test="position() = 1"> <xsl:text>&quot;</xsl:text> </xsl:if> <xsl:value-of select="." /> <xsl:choose> <xsl:when test="position() != last()"> <xsl:text>&quot;,&quot;</xsl:text> </xsl:when> <xsl:otherwise> <xsl:text>&quot;&#xa;</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:template> </xsl:transform> ''' xsl = lxet.fromstring(xslstr) transform = lxet.XSLT(xsl) newdom = transform(xml) df = pd.read_csv(StringIO(str(newdom))) 

Time (with current XML and XML with 25 times the number of children (i.e. 900 user records)

 # SHORTER FILE python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterfind()' 100 loops, best of 3: 3.87 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_iterparse()' 100 loops, best of 3: 5.5 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()' 100 loops, best of 3: 3.86 msec per loop python -mtimeit -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()' 100 loops, best of 3: 5.68 msec per loop # LARGER FILE python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterfind()' 100 loops, best of 3: 36 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_iterparse()' 100 loops, best of 3: 78.9 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xpath()' 100 loops, best of 3: 32.7 msec per loop python -mtimeit -n'100' -s'import readxml_test_runs as test' 'test.read_xml_lxml_xsl()' 100 loops, best of 3: 51.4 msec per loop 
+34
source share

Source: https://habr.com/ru/post/1263130/


All Articles