Urlib2.urlopen through a proxy server fails after several calls

Question

Urlib2.urlopen through a proxy server fails after several calls

Edit: after scrolling multiple times, it seems that urlgrabber succeeds when urllib2 fails, even when it reports that it closes the connection after each file. There seems to be something wrong with the way urllib2 handles proxies or the way I use it! Anyway, here is the simplest possible code to extract files in a loop:

import urlgrabber for i in range(1, 100): url = "http://www.iana.org/domains/example/" urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>@<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)

Hello to all!

I am trying to write a very simple python script to grab a bunch of files through urllib2.

This script should work through a proxy at work (my problem does not exist if I capture files on the intranet, i.e. without a proxy).

It is said that the script crashes after several requests with "HTTPError: HTTP Error 401: main auth failed." Any idea why this could be? The proxy seems to be rejecting my authentication, but why? The first pair of urlopen requests went right!

Edit: Adding 10 seconds sleep between requests to avoid any throttling that could be done by the proxy did not change the results.

Here is a simplified version of my script (with clearly highlighted information):

 import urllib2 passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm() passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>') authinfo = urllib2.ProxyBasicAuthHandler(passmgr) proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"}) opener = urllib2.build_opener(authinfo, proxy_support) urllib2.install_opener(opener) for i in range(100): with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile: f = urllib2.urlopen("http://www.iana.org/domains/example/") outfile.write(f.read())

Thanks in advance!

+4

python authentication proxy urllib2 urlopen

Nicolas Lefebvre Feb 25 '11 at 2:56 p.m.

source share

2 answers

A proxy server can handle your requests. I think it thinks you look like a bot.

You can add a timeout and see if this helps you.

+1

wisty Feb 26 '11 at 18:17

source share

Vge · Accepted Answer · 2011-03-04T07:50:54+0000

You can minimize the number of connections using the keepalive handler from the urlgrabber module.

 import urllib2 from keepalive import HTTPHandler keepalive_handler = HTTPHandler() opener = urllib2.build_opener(keepalive_handler) urllib2.install_opener(opener) fo = urllib2.urlopen('http://www.python.org')

I am not sure if this will work correctly with proxy settings. You may need to hack the keepalive module.

Urlib2.urlopen through a proxy server fails after several calls

More articles: