Using lxml, what causes the error "lxml.etree.XMLSyntaxError: Document is empty"?

I use mechanize / cookiejar / lxml to read the page, and this works for some, but not for others. The error I get in them is that in the header. I cannot post the pages here because they are not SFW, but is there any way to fix this? Basically, this is what I am doing:

import mechanize, cookielib from lxml import etree br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(False) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')] response = br.open('...') tree = etree.parse(response) #error 

After that, I get the root directory and look for a document for the values ​​I want. Obviously, iterparse does not break it, but at the moment I assume that this is not only because I did not process anything with it. In addition, I have not yet figured out how to look for material with it.

I tried to disable gzip and allow sending abstracts, but none of them solves the problem. I also tried to save the source code to disk and create a tree from there just for the sake of it, and I get the same error.

change
The answer I get seems fine using a printout (answer) since I suggested <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>> . I can also save the response using the read () method and verify that the saved .xml works in the browser and all that.

In addition, one of the pages has &rsquo; which causes the following error: "lxml.etree.XMLSyntaxError: Entity" rsquo "not defined, row 17, column 7054". So far I have replaced it with regex, but is there a parser that can handle this? I got this error even with lxml.html.parse suggested below.

Regarding the highlighted file, I meant that when I open it with gEdit, it does this: http://img34.imageshack.us/img34/9574/gedit.jpg

+4
source share
2 answers

use lxml.html.parse for html, it can handle even very broken html, do you still get an error message?

+2
source

What is the nature of response ? According to the help, etree.parse expects one of:

  - a file name/path - a file object - a file-like object - a URL using the HTTP or FTP protocol 
+1
source

Source: https://habr.com/ru/post/1334860/


All Articles