Can I provide the lxml.etree.parse url in Python 3?

Question

Can I provide the lxml.etree.parse url in Python 3?

The documentation states that I can:

lxml can parse a local file, an HTTP URL, or an FTP URL. It also automatically detects and reads gzip-compressed XML files (.gz).

(from http://lxml.de/parsing.html in the "Parsers" section)

but a quick experiment seems to imply otherwise:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> parser = etree.HTMLParser() >>> from urllib.request import urlopen >>> with urlopen('https://pypi.python.org/simple') as f: ... tree = etree.parse(f, parser) ... >>> tree2 = etree.parse('https://pypi.python.org/simple', parser) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655) File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263) File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564) File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561) File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456) File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543) File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003) File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015) OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple" >>>

I can use the urlopen method, but the documentation seems to imply that passing the url is somehow better. Also, I'm a little worried about relying on lxml if the documentation is inaccurate, especially if I start doing something more complex.

What is the correct way to parse HTML using lxml from a known url? And where should I look for this to be documented?

Refresh . I get the same error if I use the http URL and not https .

+5

python lxml

Paul moore Oct 2 '14 at 14:37

source share

1 answer

Paul moore · Accepted Answer · 2014-10-02T15:43:08+0000

The problem is that lxml does not support HTTPS links, and http://pypi.python.org/simple redirects to the HTTPS version.

So, for any secure website, you need to read the URL yourself:

 from lxml import etree from urllib.request import urlopen parser = etree.HTMLParser() with urlopen('https://pypi.python.org/simple') as f: tree = etree.parse(f, parser)

Can I provide the lxml.etree.parse url in Python 3?

More articles: