Edit: after scrolling multiple times, it seems that urlgrabber succeeds when urllib2 fails, even when it reports that it closes the connection after each file. There seems to be something wrong with the way urllib2 handles proxies or the way I use it! Anyway, here is the simplest possible code to extract files in a loop:
import urlgrabber for i in range(1, 100): url = "http://www.iana.org/domains/example/" urlgrabber.urlgrab(url, proxies={'http':'http://<user>:<password>@<proxy url>:<proxy port>'}, keepalive=1, close_connection=1, throttle=0)
Hello to all!
I am trying to write a very simple python script to grab a bunch of files through urllib2.
This script should work through a proxy at work (my problem does not exist if I capture files on the intranet, i.e. without a proxy).
It is said that the script crashes after several requests with "HTTPError: HTTP Error 401: main auth failed." Any idea why this could be? The proxy seems to be rejecting my authentication, but why? The first pair of urlopen requests went right!
Edit: Adding 10 seconds sleep between requests to avoid any throttling that could be done by the proxy did not change the results.
Here is a simplified version of my script (with clearly highlighted information):
import urllib2 passmgr = urllib2.HTTPPasswordMgrWithDefaultRealm() passmgr.add_password(None, '<proxy url>:<proxy port>', '<my user name>', '<my password>') authinfo = urllib2.ProxyBasicAuthHandler(passmgr) proxy_support = urllib2.ProxyHandler({"http" : "<proxy http address>"}) opener = urllib2.build_opener(authinfo, proxy_support) urllib2.install_opener(opener) for i in range(100): with open("e:/tmp/images/tst{}.htm".format(i), "w") as outfile: f = urllib2.urlopen("http://www.iana.org/domains/example/") outfile.write(f.read())
Thanks in advance!
source share