Python urllib open problem

Question

Python urllib open problem

I am trying to extract data from http://book.libertorrent.com/ , but at the moment I am failing because there are some additional data (headers) in response. My code is very simple:

response = urllib.urlopen('http://book.libertorrent.com/login.php') f = open('someFile.html', 'w') f.write(response.read())

read () returns:

 Date: Fri, 09 Nov 2012 07:36:54 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: close Cache-Control: no-cache, pre-check=0, post-check=0 Expires: 0 Pragma: no-cache Set-Cookie: bb_test=973132321; path=/; domain=book.libertorrent.com Content-Language: ru 1ec0 ...Html... 0

And response.info () is empty.

Is there any way to fix the answer?

+4

python urllib

maravan Nov 10 '12 at 17:06

source share

1 answer

mata · Accepted Answer · 2012-11-10T17:55:23+0000

Let's try this:

 $ echo -ne "GET /index.php HTTP/1.1\r\nHost: book.libertorrent.com\r\n\r\n" | nc book.libertorrent.com 80 | head -n 10 HTTP/1.1 200 OK WWW Date: Sat, 10 Nov 2012 17:41:57 GMT Content-Type: text/html; charset=utf-8 Transfer-Encoding: chunked Connection: keep-alive Content-Language: ru 1f57 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html dir="ltr">

See what "www" is in the second line? This is not a valid HTTP header, I assume that what drops the response parser here.

By the way, python2 and python3 behave differently:

python2 seems to immediately interpret anything after this invalid header as content
python3 ignores all headers and continues to read the contents after a double line. Since the headers are ignored, so is the encoding of the transfer, and therefore the lengths of the contents are interpreted as part of the body.

So the problem is that the server is sending the wrong answer, which must be fixed on the server.

Python urllib open problem

More articles: