Python XML analysis without root

I wanted to parse a rather large xml-like file that has no root element. File format:

<tag1> <tag2> </tag2> </tag1> <tag1> <tag3/> </tag1> 

I tried using the Element-Tree, but it returned a "no root" error. Is there any other python library that can be used to parse this file? Thank you in advance!:)

PS: I tried adding an extra tag to wrap the entire file and then parse it with the Element-Tree. However, I would like to use a more efficient method in which I would not need to modify the source XML file.

+6
source share
3 answers

lxml.html can parse fragments:

 from lxml import html s = """<tag1> <tag2> </tag2> </tag1> <tag1> <tag3/> </tag1>""" doc = html.fromstring(s) for thing in doc: print thing for other in thing: print other """ >>> <Element tag1 at 0x3411a80> <Element tag2 at 0x3428990> <Element tag1 at 0x3428930> <Element tag3 at 0x3411a80> >>> """ 

Provided this answer SO

And if there is more than one level of nesting:

 def flatten(nested): """recusively flatten nested elements yields individual elements """ for thing in nested: yield thing for other in flatten(thing): yield other doc = html.fromstring(s) for thing in flatten(doc): print thing 

Similarly, lxml.etree.HTML this. It adds html and body tags:

 d = etree.HTML(s) for thing in d.iter(): print thing """ <Element html at 0x3233198> <Element body at 0x322fcb0> <Element tag1 at 0x3233260> <Element tag2 at 0x32332b0> <Element tag1 at 0x322fcb0> <Element tag3 at 0x3233148> """ 
+5
source

ElementTree.fromstringlist accepts an iterable (which gives strings).

Using it with itertools.chain :

 import itertools import xml.etree.ElementTree as ET # import xml.etree.cElementTree as ET with open('xml-like-file.xml') as f: it = itertools.chain('<root>', f, '</root>') root = ET.fromstringlist(it) # Do something with `root` root.find('.//tag3') 
+5
source

How about doing something like this instead of editing the file

 import xml.etree.ElementTree as ET with file("xml-file.xml") as f: xml_object = ET.fromstringlist(["<root>", f.read(), "</root>"]) 
+4
source

Source: https://habr.com/ru/post/969909/


All Articles