How to validate XML with multiple namespaces in Python?

Question

How to validate XML with multiple namespaces in Python?

I am trying to write some unit tests in Python 2.7 to check for some extensions I made for the OAI-PMH schema: http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd

The problem I am facing is a business with multiple nested namespaces caused by this specification in the above XSD:

<complexType name="metadataType"> <annotation> <documentation>Metadata must be expressed in XML that complies with another XML Schema (namespace=#other). Metadata must be explicitly qualified in the response.</documentation> </annotation> <sequence> <any namespace="##other" processContents="strict"/> </sequence> </complexType>

Here is the code snippet I'm using:

 import lxml.etree, urllib2 query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm" schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r") schema_doc = etree.parse(schema_file) oaischema = etree.XMLSchema(schema_doc) request = urllib2.Request(query, headers=xml_headers) response = urllib2.urlopen(request) body = response.read() response_doc = etree.fromstring(body) try: oaischema.assertValid(response_doc) except etree.DocumentInvalid as e: line = 1; for i in body.split("\n"): print "{0}\t{1}".format(line, i) line += 1 print(e.message)

I get the following error:

 AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22

I understand the error because this scheme requires that the child of the metadata element is strictly checked, which is what the xml sample does.

Now I have written a validator in Java that works, however it would be useful if it were in Python, since the rest of the solution I am creating is based on Python. For my Java version to work, I had to define a DocumentFactory namespace, otherwise I would get the same error. I have not found any working example in python that does this check correctly.

Does anyone have an idea how I can get an XML document with multiple nested namespaces since my doc example is checked using Python?

Here is an example XML document I'm trying to validate:

 <?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-02-08T08:55:46Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017" metadataPrefix="oai_dc">http://arXiv.org/oai2</request> <GetRecord> <record> <header> <identifier>oai:arXiv.org:cs/0112017</identifier> <datestamp>2001-12-14</datestamp> <setSpec>cs</setSpec> <setSpec>math</setSpec> </header> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Using Structural Metadata to Localize Experience of Digital Content</dc:title> <dc:creator>Dushay, Naomi</dc:creator> <dc:subject>Digital Libraries</dc:subject> <dc:description>With the increasing technical sophistication of both information consumers and providers, there is increasing demand for more meaningful experiences of digital information. We present a framework that separates digital object experience, or rendering, from digital object storage and manipulation, so the rendering can be tailored to particular communities of users. </dc:description> <dc:description>Comment: 23 pages including 2 appendices, 8 figures</dc:description> <dc:date>2001-12-14</dc:date> </oai_dc:dc> </metadata> </record> </GetRecord> </OAI-PMH>

+4

python xml validation xsd

Jim Mar 16 '11 at 23:19

source share

1 answer

Neil santos · Answer 1 · 2012-01-20T06:25:45+0000

This is detected in the lxml doc when checking :

 >>> schema_root = etree.XML('''\ ... <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> ... <xsd:element name="a" type="xsd:integer"/> ... </xsd:schema> ... ''') >>> schema = etree.XMLSchema(schema_root) >>> parser = etree.XMLParser(schema = schema) >>> root = etree.fromstring("<a>5</a>", parser)

So maybe you need this? (See the last two lines.):

 schema_doc = etree.parse(schema_file) oaischema = etree.XMLSchema(schema_doc) request = urllib2.Request(query, headers=xml_headers) response = urllib2.urlopen(request) body = response.read() parser = etree.XMLParser(schema = oaischema) response_doc = etree.fromstring(body, parser)

How to validate XML with multiple namespaces in Python?

More articles: