How to make urllib2 requests via Tor in Python?

I am trying to crawl websites using a crawler written in Python. I want to integrate Tor with Python, which means I want to anonymously crawl the site using Tor.

I tried to do it. This does not seem to work. I checked my IP, it is still the same as before I used tor. I checked it through python.

import urllib2 proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener) 
+47
python tor
Jul 08 '09 at 6:22
source share
12 answers

You are trying to connect to the SOCKS port - Tor rejects any traffic not related to SOCKS. You can connect through an intermediary - Privoxy - using port 8118.

Example:

 proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) opener.addheaders = [('User-agent', 'Mozilla/5.0')] print opener.open('http://www.google.com').read() 

Also pay attention to the properties passed to ProxyHandler, without the ip: port http prefix

+21
Jan 06 '10 at 19:37
source share
 pip install PySocks 

Then:

 import socket import socks import urllib2 ipcheck_url = 'http://checkip.amazonaws.com/' # Actual IP. print(urllib2.urlopen(ipcheck_url).read()) # Tor IP. socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050) socket.socket = socks.socksocket print(urllib2.urlopen(ipcheck_url).read()) 

Using only urllib2.ProxyHandler , as in https://stackoverflow.com/a/166358/115635 ...

 Tor is not an HTTP Proxy 

Mentioned at How to use the SOCKS 4/5 proxy server with urllib2?

Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10.

+7
Dec 28 '15 at 12:40
source share

Using privoxy as an http-proxy before tor works for me - here's the finder pattern:

 import urllib2 import httplib from BeautifulSoup import BeautifulSoup from time import sleep class Scraper(object): def __init__(self, options, args): if options.proxy is None: options.proxy = "http://localhost:8118/" self._open = self._get_opener(options.proxy) def _get_opener(self, proxy): proxy_handler = urllib2.ProxyHandler({'http': proxy}) opener = urllib2.build_opener(proxy_handler) return opener.open def get_soup(self, url): soup = None while soup is None: try: request = urllib2.Request(url) request.add_header('User-Agent', 'foo bar useragent') soup = BeautifulSoup(self._open(request)) except (httplib.IncompleteRead, httplib.BadStatusLine, urllib2.HTTPError, ValueError, urllib2.URLError), err: sleep(1) return soup class PageType(Scraper): _URL_TEMPL = "http://foobar.com/baz/%s" def items_from_page(self, url): nextpage = None soup = self.get_soup(url) items = [] for item in soup.findAll("foo"): items.append(item["bar"]) nexpage = item["href"] return nextpage, items def get_items(self): nextpage, items = self._categories_from_page(self._START_URL % "start.html") while nextpage is not None: nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage) items.extend(newitems) return items() pt = PageType() print pt.get_items() 
+2
Jul 08 '09 at 12:13
source share

Here is the code for downloading files using tor proxy in python: (update url)

 import urllib2 url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif" proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close() 
+2
Jan 27 '12 at 17:02
source share

The following code works 100% on Python 3.4

(you need to leave TOR Browser open wil with this code)

This script connects to TOR via socks5, obtains an IP address from checkip.dyn.com, changes the identifier and resends the request to get a new IP (10 times loop)

To do this, you need to install the appropriate libraries. (Enjoy and do not abuse)

 import socks import socket import time from stem.control import Controller from stem import Signal import requests from bs4 import BeautifulSoup err = 0 counter = 0 url = "checkip.dyn.com" with Controller.from_port(port = 9151) as controller: try: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150) socket.socket = socks.socksocket while counter < 10: r = requests.get("http://checkip.dyn.com") soup = BeautifulSoup(r.content) print(soup.find("body").text) counter = counter + 1 #wait till next identity will be available controller.signal(Signal.NEWNYM) time.sleep(controller.get_newnym_wait()) except requests.HTTPError: print("Could not reach URL") err = err + 1 print("Used " + str(counter) + " IPs and got " + str(err) + " errors") 
+2
May 22 '15 at 19:01
source share

Update - The last (at the top v2.10.0) requests library supports socks proxies with the additional request requests[socks] requirement.

Installation -

 pip install requests requests[socks] 

The main use is

 import requests session = requests.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP print requests.get("http://httpbin.org/ip").text 



Old answer - Although this is an old post, it answers because no one seems to have mentioned the requesocks library.

This is basically the port of the requests library. Please note that the library is an old fork (latest updated version 2013-03-25) and may not have the same functions as the latest query library.

Installation -

 pip install requesocks 

The main use is

 # Assuming that Tor is up & running import requesocks session = requesocks.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP import requests print requests.get("http://httpbin.org/ip").text 
+2
Nov 23 '15 at 16:28
source share

The following solution works for me in Python 3 . Adapted from CiroSantilli answer :

With urllib ( urllib name in Python 3):

 import socks import socket from urllib.request import urlopen url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = urlopen(url) print(response.read()) 

With requests :

 import socks import socket import requests url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = requests.get(url) print(response.text) 

With Selenium + PhantomJS:

 from selenium import webdriver url = 'http://icanhazip.com/' service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ] phantomjs_path = '/your/path/to/phantomjs' driver = webdriver.PhantomJS( executable_path=phantomjs_path, service_args=service_args) driver.get(url) print(driver.page_source) driver.close() 

Note If you plan to use Tor often, consider making a donation to support their amazing work.

+2
Oct 20 '16 at 12:46 on
source share

Perhaps you have network connectivity issues? The above script worked for me (I replaced a different URL - I used http://stackoverflow.com/ - and I got the page as expected:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" > <html> <head> <title>Stack Overflow</title> <link rel="stylesheet" href="/content/all.css?v=3856"> 

(etc.).

+1
Jul 08 '09 at 6:34
source share

Tor is a proxy server. Connecting to it directly using the example you are quoting , with the error "urlopen Tunnel connection error: 501 Tor is not an HTTP proxy." As others have said, you can get around this with Privoxy.

Alternatively, you can also use PycURL or SocksiPy. Examples of use as with tor see ...

https://stem.torproject.org/tutorials/to_russia_with_love.html

0
Jun 04 '15 at 16:41
source share

you can use torify

run your program with

 ~$torify python your_program.py 
0
Nov 20 '16 at 23:24
source share

Thought I'd just share the solution that worked for me (python3, windows10):

Step 1. Turn on Tor ControlPort in step 9151 .

Tor works by default on port 9150 and ControlPort on 9151 . You should see the local address 127.0.0.1:9150 and 127.0.0.1:9151 when running netstat -an .

 [go to windows terminal] cd ...\Tor Browser\Browser\TorBrowser\Tor tor --service remove tor --service install -options ControlPort 9151 netstat -an 

Step 2: Python script.

 # library to launch and kill Tor process import os import subprocess # library for Tor connection import socket import socks import http.client import time import requests from stem import Signal from stem.control import Controller # library for scraping import csv import urllib from bs4 import BeautifulSoup import time def launchTor(): # start Tor (wait 30 sec for Tor to load) sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe') time.sleep(30) return sproc def killTor(sproc): sproc.kill() def connectTor(): socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True) socket.socket = socks.socksocket print("Connected to Tor") def set_new_ip(): # disable socks server and enabling again socks.setdefaultproxy() """Change IP using TOR""" with Controller.from_port(port=9151) as controller: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True) socket.socket = socks.socksocket controller.signal(Signal.NEWNYM) def checkIP(): conn = http.client.HTTPConnection("icanhazip.com") conn.request("GET", "/") time.sleep(3) response = conn.getresponse() print('current ip address :', response.read()) # Launch Tor and connect to Tor network sproc = launchTor() connectTor() # list of url to scrape url_list = [list of all the urls you want to scrape] for url in url_list: # set new ip and check ip before scraping for each new url set_new_ip() # allow some time for IP address to refresh time.sleep(5) checkIP() ''' [insert your scraping code here: bs4, urllib, your usual thingy] ''' # remember to kill process killTor(sproc) 

This scenario above will update the IP address for each URL that you want to clear. Just make sure to sleep long enough for the IP to change. The last one was yesterday. Hope this helps!

0
Jan 08 '18 at 5:03
source share

To expand the above comment about using torify and the Tor browser (and does not need Privoxy):

 pip install PySocks pip install pyTorify 

(install the Tor browser and run it)

Using the command line:

 python -mtorify -p 127.0.0.1:9150 your_script.py 

Or built into the script:

 import torify torify.set_tor_proxy("127.0.0.1", 9150) torify.disable_tor_check() torify.use_tor_proxy() # use urllib as normal import urllib.request req = urllib.request.Request("http://....") req.add_header("Referer", "http://...") # etc res = urllib.request.urlopen(req) html = res.read().decode("utf-8") 

Please Note: Tor Browser Uses Port 9150, Not 9050

0
Mar 20 '18 at 18:04
source share



All Articles