How can I get xml.sax to use HTTP proxies for my DTD requests?

XML Parsers are known to often send HTTP requests to select DTDs that are referenced in documents. In particular, Python does this . This causes excessive traffic for www.w3.org, which hosts many of these DTDs. In turn, this makes XML parsing time-consuming and, in some cases, a timeout. This can be a serious problem, as it makes the task seem to be related only to text processing, depending on an unreliable third party.

To mitigate this problem (since the real solution is very complicated), I would like to install a local caching web proxy and ask xml.sax to send its requests through this proxy. I specifically do not want the proxy server settings leaked to other components, so the system settings are out of the question.

How can I get xml.sax to use an HTTP proxy?

I have:

handler = # instance of a subclass of xml.sax.handler.ContentHandler

parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse(indata)
return handler.result()

One approach is to use a custom EntityResolver. However, it turns out that EntityResolver caching cannot be implemented because it does not receive enough information.

+3
source share
1 answer

saxutils.prepare_input_source. + , urllib.urlopen, UrlOpener urllib2 .

, , , EntityResolver, .

, saxutils.prepare_input_source urllib.urlopen . , -, urllib.


: :

def make_caching_prepare_input_source(old_prepare_input_source, proxy):
    def caching_prepare_input_source(source, base = None):
        if isinstance(source, xmlreader.InputSource):
            return source

        full_uri = urlparse.urljoin(base or "", source)

        if not full_uri.startswith('http:'):
            args = (source,) if base == None else (source, base)
            return old_prepare_input_source(*args)

        r = urllib2.Request(full_uri)
        r.set_proxy(proxy, 'http')
        f = urllib2.urlopen(r)

        i = xmlreader.InputSource()
        i.setSystemId(source)
        i.setByteStream(f)

        return i

    return caching_prepare_input_source

def enable_http_proxy(server):
    saxutils.prepare_input_source = make_caching_prepare_input_source(
        saxutils.prepare_input_source,
        server,
    )
+2
source

Source: https://habr.com/ru/post/1776416/


All Articles