I use mechanize / cookiejar / lxml to read the page, and this works for some, but not for others. The error I get in them is that in the header. I cannot post the pages here because they are not SFW, but is there any way to fix this? Basically, this is what I am doing:
import mechanize, cookielib from lxml import etree br = mechanize.Browser() cj = cookielib.LWPCookieJar() br.set_cookiejar(cj) br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(False) br.set_handle_robots(False) br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 maverick Firefox/3.6.13')] response = br.open('...') tree = etree.parse(response)
After that, I get the root directory and look for a document for the values I want. Obviously, iterparse does not break it, but at the moment I assume that this is not only because I did not process anything with it. In addition, I have not yet figured out how to look for material with it.
I tried to disable gzip and allow sending abstracts, but none of them solves the problem. I also tried to save the source code to disk and create a tree from there just for the sake of it, and I get the same error.
change
The answer I get seems fine using a printout (answer) since I suggested <response_seek_wrapper at 0xa4a160c whose wrapped object = <stupid_gzip_wrapper at 0xa49acec whose fp = <socket._fileobject object at 0xa49c32c>>> . I can also save the response using the read () method and verify that the saved .xml works in the browser and all that.
In addition, one of the pages has ’ which causes the following error: "lxml.etree.XMLSyntaxError: Entity" rsquo "not defined, row 17, column 7054". So far I have replaced it with regex, but is there a parser that can handle this? I got this error even with lxml.html.parse suggested below.
Regarding the highlighted file, I meant that when I open it with gEdit, it does this: http://img34.imageshack.us/img34/9574/gedit.jpg