Python XML analysis without root

Question

Python XML analysis without root

I wanted to parse a rather large xml-like file that has no root element. File format:

<tag1> <tag2> </tag2> </tag1> <tag1> <tag3/> </tag1>

I tried using the Element-Tree, but it returned a "no root" error. Is there any other python library that can be used to parse this file? Thank you in advance!:)

PS: I tried adding an extra tag to wrap the entire file and then parse it with the Element-Tree. However, I would like to use a more efficient method in which I would not need to modify the source XML file.

+6

python xml python-2.7 parsing elementtree

sgp May 27 '14 at 13:50

source share

3 answers

wwii · Answer 1 · 2014-05-27T14:14:50+0000

lxml.html can parse fragments:

 from lxml import html s = """<tag1> <tag2> </tag2> </tag1> <tag1> <tag3/> </tag1>""" doc = html.fromstring(s) for thing in doc: print thing for other in thing: print other """ >>> <Element tag1 at 0x3411a80> <Element tag2 at 0x3428990> <Element tag1 at 0x3428930> <Element tag3 at 0x3411a80> >>> """

Provided this answer SO

And if there is more than one level of nesting:

 def flatten(nested): """recusively flatten nested elements yields individual elements """ for thing in nested: yield thing for other in flatten(thing): yield other doc = html.fromstring(s) for thing in flatten(doc): print thing

Similarly, lxml.etree.HTML this. It adds html and body tags:

 d = etree.HTML(s) for thing in d.iter(): print thing """ <Element html at 0x3233198> <Element body at 0x322fcb0> <Element tag1 at 0x3233260> <Element tag2 at 0x32332b0> <Element tag1 at 0x322fcb0> <Element tag3 at 0x3233148> """

falsetru · Answer 2 · 2014-05-27T14:16:42+0000

ElementTree.fromstringlist accepts an iterable (which gives strings).

Using it with itertools.chain :

 import itertools import xml.etree.ElementTree as ET # import xml.etree.cElementTree as ET with open('xml-like-file.xml') as f: it = itertools.chain('<root>', f, '</root>') root = ET.fromstringlist(it) # Do something with `root` root.find('.//tag3')

nettux443 · Answer 3 · 2014-05-27T14:10:38+0000

How about doing something like this instead of editing the file

 import xml.etree.ElementTree as ET with file("xml-file.xml") as f: xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"])

Python XML analysis without root

More articles: