Xmlns namespace break lxml

Question

Xmlns namespace break lxml

I am trying to open an xml file and get values from specific tags. I have done this a lot, but this particular xml gives me some problems. Here is the xml file section:

<?xml version='1.0' encoding='UTF-8'?> <package xmlns="http://apple.com/itunes/importer" version="film4.7"> <provider>filmgroup</provider> <language>en-GB</language> <actor name="John Smith" display="Doe John"</actor> </package>

And here is an example of my Python code:

 metadata = '/Users/mylaptop/Desktop/Python/metadata.xml' from lxml import etree parser = etree.XMLParser(remove_blank_text=True) open(metadata) tree = etree.parse(metadata, parser) root = tree.getroot() for element in root.iter(tag='provider'): providerValue = tree.find('//provider') providerValue = providerValue.text print providerValue tree.write('/Users/mylaptop/Desktop/Python/metadataDone.xml', pretty_print = True, xml_declaration = True, encoding = 'UTF-8')

When I run this, it cannot find the provider tag or its value. If I remove xmlns="http://apple.com/itunes/importer" then everything will work as expected. My question is, how can I remove this namespace, since it doesn't interest me at all, so I can get the tag values that I need using lxml?

+6

python namespaces lxml xml-namespaces

speedyrazor Aug 05 '13 at 21:12

source share

2 answers

My suggestion is not to ignore the namespace, but to take it into account instead. I wrote some related functions (copied with minor modifications) for my work in the django-quickbooks library. Using these functions, you can do this:

 providers = getels(root, 'provider', ns='http://apple.com/itunes/importer')

These functions are:

 def get_tag_with_ns(tag_name, ns): return '{%s}%s' % (ns, tag_name) def getel(elt, tag_name, ns=None): """ Gets the first tag that matches the specified tag_name taking into account the QB namespace. :param ns: The namespace to use if not using the default one for django-quickbooks. :type ns: string """ res = elt.find(get_tag_with_ns(tag_name, ns=ns)) if res is None: raise TagNotFound('Could not find tag by name "%s"' % tag_name) return res def getels(elt, *path, **kwargs): """ Gets the first set of elements found at the specified path. Example: >>> xml = ( "<root>" + "<item>" + "<id>1</id>" + "</item>" + "<item>" + "<id>2</id>"* + "</item>" + "</root>") >>> el = etree.fromstring(xml) >>> getels(el, 'root', 'item', ns='correct/namespace') [<Element item>, <Element item>] """ ns = kwargs['ns'] i=-1 for i in range(len(path)-1): elt = getel(elt, path[i], ns=ns) tag_name = path[i+1] return elt.findall(get_tag_with_ns(tag_name, ns=ns))

+1

Josh Aug 05 '13 at 21:26

source share

unutbu · Accepted Answer · 2013-08-05T21:22:59+0000

The provider tag is located in the http://apple.com/itunes/importer namespace, so you need to either use the full name

 {http://apple.com/itunes/importer}provider

or use one of the lxml methods that has a namespaces parameter , for example root.xpath . Then you can specify it with a namespace prefix (for example, ns:provider ):

 from lxml import etree parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(metadata, parser) root = tree.getroot() namespaces = {'ns':'http://apple.com/itunes/importer'} items = iter(root.xpath('//ns:provider/text()|//ns:actor/@name', namespaces=namespaces)) for provider, actor in zip(*[items]*2): print(provider, actor)

gives

 ('filmgroup', 'John Smith')

Note that the XPath used above assumes that the <provider> and <actor> elements always appear in rotation. If this is not the case, then there are, of course, ways to handle this, but the code becomes a bit more verbose:

 for package in root.xpath('//ns:package', namespaces=namespaces): for provider in package.xpath('ns:provider', namespaces=namespaces): providerValue = provider.text print providerValue for actor in package.xpath('ns:actor', namespaces=namespaces): print actor.attrib['name']

Xmlns namespace break lxml

More articles: