What is wrong with this gzip format?

Question

What is wrong with this gzip format?

I use the following python code to load web pages from servers with gzip compression:

url = "http://www.v-gn.de/wbb/"
import urllib2
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()

import gzip
from StringIO import StringIO
html = gzip.GzipFile(fileobj=StringIO(content)).read()

This works in general, but for the specified URL with an error struct.error. I get a similar result if I use wget with the title "Accept-encoding". However, browsers seem to be able to unpack the response.

So my question is: is there a way to get my python code to unpack the HTTP response without resorting to disabling compression by removing the "Accept-encoding" header?

For completeness, here is the line I use for wget:

wget --user-agent="Mozilla" --header="Accept-Encoding: gzip,deflate" http://www.v-gn.de/wbb/

+3

python http gzip

itsadok Sep 06 '10 at 14:20

source share

3 answers

. gzip-ed index.html. index.html index.html.gz. gzip -d inedx.html.gz, : gzip: index.html.gz: unexpected end of file.

zcat index.html.gz, , , </html> , .

$ zcat index.html.gz
...
  </td>
 </tr>
</table>


</body>
</html>
gzip: index.html.gz: unexpected end of file

.

+3

Notinlist 06 . '10 14:27

, urllib2.HTTPHandler http_open().

import gzip
from StringIO import StringIO
import httplib, urllib, urllib2
class GzipHandler(urllib2.HTTPHandler):
    def http_open(self, req):
        req.add_header('Accept-encoding', 'gzip')
        r = self.do_open(httplib.HTTPConnection, req)
        if (
            'Content-Encoding'in r.headers and
            r.headers['Content-Encoding'] == 'gzip'
        ):
            fp = gzip.GzipFile(fileobj=StringIO(r.read()))
        else:
            fp = r
        response = urllib.addinfourl(fp, r.headers, r.url, r.code)
        response.msg = r.msg
        return respsone

.

def retrieve(url):
    request = urllib2.Request(url)
    opener = urllib2.build_opener(GzipHandler)
    return opener.open(request)

, gzip .

.

+3

Derrick petzold Jul 14 '11 at 9:30

source share

unutbu · Accepted Answer · 2010-09-06T14:53:25+0000

, readline() gzip.GzipFile, read() a struct.error, .

readline ( ), - :

import urllib2
import StringIO
import gzip
import struct

url = "http://www.v-gn.de/wbb/"
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
content = response.read()
response.close()
fh=StringIO.StringIO(content)
html = gzip.GzipFile(fileobj=StringIO.StringIO(content))
try:
    for line in html:
        line=line.rstrip()
        print(line)
except struct.error:
    pass

What is wrong with this gzip format?

More articles: