Use iterparse and subsequently xpath for documents with inconsistent namespace declarations

I need to collect a piece of code that parses a possibly large XML file in custom Python objects. The idea is roughly the following:

from lxml import etree for e, tag in etree.iterparse(source, tag='Foo'): print tag.xpath('bar/baz')[42] # there actually a function call here 

The problem is that some of the documents have a namespace declaration, and some do not. This means that in the code above the tag='Foo' and xpath will not work.

At the moment I met with the ugly

 for e, tag in etree.iterparse(source): if tag.tag.endswith('Foo'): print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42] 

but it’s so terrible that I want everything to be correct, although it works great. (I think this should also be slower.)

Is there a way to write reasonable code that will take both cases into account using iterparse ? For the moment, I can only think about catching the start-ns and end-ns events and updating the β€œstate save” variable, and I have to go to the function that is called in the loop to do this work. Then the function will build xpath requests. It makes sense, but I wonder if there is an easier way around this.

PS I obviously tried to search, but did not find a solution that would work with or without a namespace. I would also make a decision that excludes namespaces from XML, but only if it does not store the entire tree in RAM in the process.

+4
source share
1 answer

All elements have a display attribute .nsmap ; use it to discover your namespace and branch accordingly.

+2
source

Source: https://habr.com/ru/post/1433153/


All Articles