I need to collect a piece of code that parses a possibly large XML file in custom Python objects. The idea is roughly the following:
from lxml import etree for e, tag in etree.iterparse(source, tag='Foo'): print tag.xpath('bar/baz')[42]
The problem is that some of the documents have a namespace declaration, and some do not. This means that in the code above the tag='Foo' and xpath will not work.
At the moment I met with the ugly
for e, tag in etree.iterparse(source): if tag.tag.endswith('Foo'): print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]
but itβs so terrible that I want everything to be correct, although it works great. (I think this should also be slower.)
Is there a way to write reasonable code that will take both cases into account using iterparse ? For the moment, I can only think about catching the start-ns and end-ns events and updating the βstate saveβ variable, and I have to go to the function that is called in the loop to do this work. Then the function will build xpath requests. It makes sense, but I wonder if there is an easier way around this.
PS I obviously tried to search, but did not find a solution that would work with or without a namespace. I would also make a decision that excludes namespaces from XML, but only if it does not store the entire tree in RAM in the process.
source share