Why does urllib return garbage from some Wikipedia articles?

Question

Why does urllib return garbage from some Wikipedia articles?

>>> import urllib2 >>> good_article = 'http://en.wikipedia.org/wiki/Wikipedia' >>> bad_article = 'http://en.wikipedia.org/wiki/India' >>> req1 = urllib2.Request(good_article) >>> req2 = urllib2.Request(bad_article) >>> req1.add_header('User-Agent', 'Mozilla/5.0') >>> req2.add_header('User-Agent', 'Mozilla/5.0') >>> result1 = urllib2.urlopen(req1) >>> result2 = urllib2.urlopen(req2) >>> result1.readline() '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n' >>> result2.readline() '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xec\xfdi\x8f$I\x96\x18\x08~\xee\xfe\x15BO\x06+\x82\xeefn\xa7\x9b[D\x855<<\x8e\x8c\xcc8\x9c\xe1\x9e\x99l6{\x15bf\xeaf\x1a\xae\xa6j\xa5\x87{x\x12\x1cT-\xb0 \xb1\xc0\x00\x0b4\x81\x01wg?\x10S\xe4\xee\x92\x98\x9d\x9ec\x01\x12\x8b]\x02\xdd5\x1f8\x1c\xf07\xd4\xd4\x1f\xd8\xbf\xb0\xef\x10\x11\x155\x15\xb5\xc3#\xb2"\xbaf\xea\x087\x95KEE\x9e<y\xf7\xfb\xf9\xdfz\xfa\xf6\xf4\xe2O\xcf\x9e\x89y\xb6\x08\xc5\xd9wO^\xbd<\x15{\x8d\xc3\xc3\x1f\xba\xa7\x87\x87O/\x9e\x8a\xbf\xff\xf5\xc5\xebW\xa2\xddl\x89\x8bDFi\x90\x05q$\xc3\xc3\xc3go\xf6\xc4\xde<\xcb\x96\x0f\x0f\x0fonn\x9a7\xddf\x9c\xcc\x0e/\xde\x1d~\xc0\xb1\xda\xd8Y\xfdldV\xcf\xe64\x9b\xee\x8d\xfe\xf8\xe7\xf4\xc2PF\xb3\xc7{~\xb4\'\xa6A\xf2x/\xcc\x92=\xf1a\x11F\xe9c\xc7\xd0\xed\xe1p\xc8#R\x7f_N\xe1O\x16d\xa1?z\x19M\x03)\x1a\xe2\x87\xe0*X\xfa\xf0\ xfb@ds _\\&\xbe/\xfchr;\tc*\xfe\xf9!\xb7\xff\xe3\x9f/\xfcL\n'

It seems that the reason is not in the headers, because I tried the exact same headers as my browser, and urllib2 still returns this garbage.

Most returned pages are usually

+4

python urllib2

CSZ Feb 27 '11 at 7:26

source share

3 answers

I think there is something else that is causing you problems. This series of bytes looks in some encoded form.

 import urllib2 bad_article = 'http://en.wikipedia.org/wiki/India' req = urllib2.Request(bad_article) req.add_header('User-Agent', 'Mozilla/5.0') result = urllib2.urlopen(req) print result.readline()

led to this

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

what is right.

+1

Senthil kumaran Feb 27 '11 at 7:39

source share

Make a -i curl for both links. If everything is in order, there are no problems with the environment.

0

Rahul arora Feb 27 '11 at 8:43

source share

John machin · Accepted Answer · 2011-02-27T08:45:24+0000

This is not a medium, locale, or coding problem. The offending byte stream is gzip-compressed. \x1f\x8B at the beginning is what you get at the beginning of the gzip stream with the default settings.

It looks like the server is ignoring the fact that you did not

req2.add_header('Accept-encoding', 'gzip')

You should look at result.headers.getheader('Content-Encoding') and unpack it yourself if necessary.

Why does urllib return garbage from some Wikipedia articles?

More articles: