Use iterparse and subsequently xpath for documents with inconsistent namespace declarations

Question

Use iterparse and subsequently xpath for documents with inconsistent namespace declarations

I need to collect a piece of code that parses a possibly large XML file in custom Python objects. The idea is roughly the following:

from lxml import etree for e, tag in etree.iterparse(source, tag='Foo'): print tag.xpath('bar/baz')[42] # there actually a function call here

The problem is that some of the documents have a namespace declaration, and some do not. This means that in the code above the tag='Foo' and xpath will not work.

At the moment I met with the ugly

 for e, tag in etree.iterparse(source): if tag.tag.endswith('Foo'): print tag.xpath('*[local-name()="bar"]/*[local-name()="baz"]')[42]

but it’s so terrible that I want everything to be correct, although it works great. (I think this should also be slower.)

Is there a way to write reasonable code that will take both cases into account using iterparse ? For the moment, I can only think about catching the start-ns and end-ns events and updating the “state save” variable, and I have to go to the function that is called in the loop to do this work. Then the function will build xpath requests. It makes sense, but I wonder if there is an easier way around this.

PS I obviously tried to search, but did not find a solution that would work with or without a namespace. I would also make a decision that excludes namespaces from XML, but only if it does not store the entire tree in RAM in the process.

+4

python xml-parsing lxml xml-namespaces iterparse

Lev levitsky Sep 08 '12 at 16:46

source share

1 answer

Martijn pieters · Accepted Answer · 2012-09-08T17:24:23+0000

All elements have a display attribute .nsmap ; use it to discover your namespace and branch accordingly.

Use iterparse and subsequently xpath for documents with inconsistent namespace declarations

More articles: