Attempting to access the Internet using urllib2 in Python

Question

Attempting to access the Internet using urllib2 in Python

I am trying to write a program that (among other things) will receive text or source code from a predefined website. I am learning Python for this, and most sources told me to use urllib2 . As a test, I tried this code:

 import urllib2 response = urllib2.urlopen('http://www.python.org') html = response.read()

Instead of acting in any expected way, the shell just sits there, as it waits for input. Not even " >>>" or " ... ". The only way to get out of this state is by using [ctrl] + c. When I do this, I get a whole bunch of error messages like

 Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen return _opener.open(url, data) File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open response = self._open(req, data)

I would appreciate any feedback. Is there a tool other than urllib2 to use, or can you give tips on how to fix this. I use a network computer at my work, and I'm not quite sure how the shell is configured or how it can affect anything.

+1

python urllib2

Brad elliott Jan 6 '12 at 17:07

source share

4 answers

This is not a good answer to the question "How to do this with urllib2", but let me suggest python-requests . The whole reason it exists is because the author found urllib2 a cumbersome mess. And he is probably right.

+2

Tom Jan 6 '12 at 17:25

source share

This is very strange, have you tried a different url?
Otherwise, HTTPLib , however, is more complicated. Here is your example using HTTPLib

 import httplib as h domain = h.HTTPConnection('www.python.org') domain.connect() domain.request('GET', '/fish.html') response = domain.getresponse() if response.status == h.OK: html = response.read()

0

ProfSmiles Jan 6 '12 at 17:22

source share

I get 404 error almost immediately (without freezing):

 >>> import urllib2 >>> response = urllib2.urlopen('http://www.python.org/fish.html') Traceback (most recent call last): ... urllib2.HTTPError: HTTP Error 404: Not Found

If I try to contact an address on which the HTTP server does not work, it hangs for a long time until a timeout occurs. You can shorten it by passing a timeout parameter to urlopen :

 >>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5) Traceback (most recent call last): ... urllib2.URLError: <urlopen error timed out>

0

jterrace Jan 6 '12 at 17:23

source share

Giacomo lacava · Accepted Answer · 2012-01-06T23:38:50+0000

With a probability of 99.999%, this is a proxy problem. Python is incredibly bad at finding the correct HTTP proxy to use, and when it cannot find the right one, it just freezes and eventually shuts down.

So, first you need to find out which proxy server to use, check your browser settings (Tools → Internet Options → Connections → LAN Settings ... in IE, etc.). If it uses a script for autoconfiguration, you will need to get a script (which should be some kind of javascript) and find out where your request should go. If no script is specified, and the "automatically detect" option is checked, you can simply ask your IT specialist for an IT specialist.

I assume you are using Python 2.x. From Python urllib on urllib :

 # Use http://www.someproxy.com:3128 for http proxying proxies = {'http': 'http://www.someproxy.com:3128'} filehandle = urllib.urlopen(some_url, proxies=proxies)

Note that the point in the ProxyHandler that calculates the default values is what happens already when using urlopen , so this probably won't work.

If you really want urllib2, you need to specify a ProxyHandler, as an example in this page . Authentication may or may not be required (usually not).

Attempting to access the Internet using urllib2 in Python

More articles: