How to make urllib2 requests via Tor in Python?

Question

How to make urllib2 requests via Tor in Python?

I am trying to crawl websites using a crawler written in Python. I want to integrate Tor with Python, which means I want to anonymously crawl the site using Tor.

I tried to do it. This does not seem to work. I checked my IP, it is still the same as before I used tor. I checked it through python.

import urllib2 proxy_handler = urllib2.ProxyHandler({"tcp":"http://127.0.0.1:9050"}) opener = urllib2.build_opener(proxy_handler) urllib2.install_opener(opener)

+47

python tor

michael steve Jul 08 '09 at 6:22

source share

12 answers

Dmitri Farkov · Answer 1 · 2010-01-06 19:37

You are trying to connect to the SOCKS port - Tor rejects any traffic not related to SOCKS. You can connect through an intermediary - Privoxy - using port 8118.

Example:

 proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"}) opener = urllib2.build_opener(proxy_support) opener.addheaders = [('User-agent', 'Mozilla/5.0')] print opener.open('http://www.google.com').read()

Also pay attention to the properties passed to ProxyHandler, without the ip: port http prefix

Ciro Santilli 新疆改造中心六四事件法轮功 · Answer 2 · 2015-12-28 12:40

 pip install PySocks

Then:

 import socket import socks import urllib2 ipcheck_url = 'http://checkip.amazonaws.com/' # Actual IP. print(urllib2.urlopen(ipcheck_url).read()) # Tor IP. socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050) socket.socket = socks.socksocket print(urllib2.urlopen(ipcheck_url).read())

Using only urllib2.ProxyHandler , as in https://stackoverflow.com/a/166358/115635 ...

 Tor is not an HTTP Proxy

Mentioned at How to use the SOCKS 4/5 proxy server with urllib2?

Tested on Ubuntu 15.10, Tor 0.2.6.10, Python 2.7.10.

Jochen Wersdörfer · Answer 3 · 2009-07-08 12:13

Using privoxy as an http-proxy before tor works for me - here's the finder pattern:

 import urllib2 import httplib from BeautifulSoup import BeautifulSoup from time import sleep class Scraper(object): def __init__(self, options, args): if options.proxy is None: options.proxy = "http://localhost:8118/" self._open = self._get_opener(options.proxy) def _get_opener(self, proxy): proxy_handler = urllib2.ProxyHandler({'http': proxy}) opener = urllib2.build_opener(proxy_handler) return opener.open def get_soup(self, url): soup = None while soup is None: try: request = urllib2.Request(url) request.add_header('User-Agent', 'foo bar useragent') soup = BeautifulSoup(self._open(request)) except (httplib.IncompleteRead, httplib.BadStatusLine, urllib2.HTTPError, ValueError, urllib2.URLError), err: sleep(1) return soup class PageType(Scraper): _URL_TEMPL = "http://foobar.com/baz/%s" def items_from_page(self, url): nextpage = None soup = self.get_soup(url) items = [] for item in soup.findAll("foo"): items.append(item["bar"]) nexpage = item["href"] return nextpage, items def get_items(self): nextpage, items = self._categories_from_page(self._START_URL % "start.html") while nextpage is not None: nextpage, newitems = self.items_from_page(self._URL_TEMPL % nextpage) items.extend(newitems) return items() pt = PageType() print pt.get_items()

carloona · Answer 4 · 2012-01-27 17:02

Here is the code for downloading files using tor proxy in python: (update url)

 import urllib2 url = "http://www.disneypicture.net/data/media/17/Donald_Duck2.gif" proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8118'}) opener = urllib2.build_opener(proxy) urllib2.install_opener(opener) file_name = url.split('/')[-1] u = urllib2.urlopen(url) f = open(file_name, 'wb') meta = u.info() file_size = int(meta.getheaders("Content-Length")[0]) print "Downloading: %s Bytes: %s" % (file_name, file_size) file_size_dl = 0 block_sz = 8192 while True: buffer = u.read(block_sz) if not buffer: break file_size_dl += len(buffer) f.write(buffer) status = r"%10d [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size) status = status + chr(8)*(len(status)+1) print status, f.close()

Amine · Answer 5 · 2015-05-22 19:01

The following code works 100% on Python 3.4

(you need to leave TOR Browser open wil with this code)

This script connects to TOR via socks5, obtains an IP address from checkip.dyn.com, changes the identifier and resends the request to get a new IP (10 times loop)

To do this, you need to install the appropriate libraries. (Enjoy and do not abuse)

 import socks import socket import time from stem.control import Controller from stem import Signal import requests from bs4 import BeautifulSoup err = 0 counter = 0 url = "checkip.dyn.com" with Controller.from_port(port = 9151) as controller: try: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150) socket.socket = socks.socksocket while counter < 10: r = requests.get("http://checkip.dyn.com") soup = BeautifulSoup(r.content) print(soup.find("body").text) counter = counter + 1 #wait till next identity will be available controller.signal(Signal.NEWNYM) time.sleep(controller.get_newnym_wait()) except requests.HTTPError: print("Could not reach URL") err = err + 1 print("Used " + str(counter) + " IPs and got " + str(err) + " errors")

shad0w_wa1k3r · Answer 6 · 2015-11-23 16:28

Update - The last (at the top v2.10.0) requests library supports socks proxies with the additional request requests[socks] requirement.

Installation -

 pip install requests requests[socks]

The main use is

 import requests session = requests.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP print requests.get("http://httpbin.org/ip").text

Old answer - Although this is an old post, it answers because no one seems to have mentioned the requesocks library.

This is basically the port of the requests library. Please note that the library is an old fork (latest updated version 2013-03-25) and may not have the same functions as the latest query library.

Installation -

 pip install requesocks

The main use is

 # Assuming that Tor is up & running import requesocks session = requesocks.session() # Tor uses the 9050 port as the default socks port session.proxies = {'http': 'socks5://127.0.0.1:9050', 'https': 'socks5://127.0.0.1:9050'} # Make a request through the Tor connection # IP visible through Tor print session.get("http://httpbin.org/ip").text # Above should print an IP different than your public IP # Following prints your normal public IP import requests print requests.get("http://httpbin.org/ip").text

J0ANMM · Answer 7 · 2016-10-20 12:46

The following solution works for me in Python 3 . Adapted from CiroSantilli answer :

With urllib ( urllib name in Python 3):

 import socks import socket from urllib.request import urlopen url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = urlopen(url) print(response.read())

With requests :

 import socks import socket import requests url = 'http://icanhazip.com/' socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9150) socket.socket = socks.socksocket response = requests.get(url) print(response.text)

With Selenium + PhantomJS:

 from selenium import webdriver url = 'http://icanhazip.com/' service_args = [ '--proxy=localhost:9150', '--proxy-type=socks5', ] phantomjs_path = '/your/path/to/phantomjs' driver = webdriver.PhantomJS( executable_path=phantomjs_path, service_args=service_args) driver.get(url) print(driver.page_source) driver.close()

Note If you plan to use Tor often, consider making a donation to support their amazing work.

Vinay Sajip · Answer 8 · 2009-07-08 06:34

Perhaps you have network connectivity issues? The above script worked for me (I replaced a different URL - I used http://stackoverflow.com/ - and I got the page as expected:

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd" > <html> <head> <title>Stack Overflow</title> <link rel="stylesheet" href="/content/all.css?v=3856">

(etc.).

Damian · Answer 9 · 2015-06-04 16:41

Tor is a proxy server. Connecting to it directly using the example you are quoting , with the error "urlopen Tunnel connection error: 501 Tor is not an HTTP proxy." As others have said, you can get around this with Privoxy.

Alternatively, you can also use PycURL or SocksiPy. Examples of use as with tor see ...

https://stem.torproject.org/tutorials/to_russia_with_love.html

mohamed emad · Answer 10 · 2016-11-20 23:24

you can use torify

run your program with

 ~$torify python your_program.py

KittyBot · Answer 11 · 2018-01-08 05:03

Thought I'd just share the solution that worked for me (python3, windows10):

Step 1. Turn on Tor ControlPort in step 9151 .

Tor works by default on port 9150 and ControlPort on 9151 . You should see the local address 127.0.0.1:9150 and 127.0.0.1:9151 when running netstat -an .

 [go to windows terminal] cd ...\Tor Browser\Browser\TorBrowser\Tor tor --service remove tor --service install -options ControlPort 9151 netstat -an

Step 2: Python script.

 # library to launch and kill Tor process import os import subprocess # library for Tor connection import socket import socks import http.client import time import requests from stem import Signal from stem.control import Controller # library for scraping import csv import urllib from bs4 import BeautifulSoup import time def launchTor(): # start Tor (wait 30 sec for Tor to load) sproc = subprocess.Popen(r'.../Tor Browser/Browser/firefox.exe') time.sleep(30) return sproc def killTor(sproc): sproc.kill() def connectTor(): socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True) socket.socket = socks.socksocket print("Connected to Tor") def set_new_ip(): # disable socks server and enabling again socks.setdefaultproxy() """Change IP using TOR""" with Controller.from_port(port=9151) as controller: controller.authenticate() socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9150, True) socket.socket = socks.socksocket controller.signal(Signal.NEWNYM) def checkIP(): conn = http.client.HTTPConnection("icanhazip.com") conn.request("GET", "/") time.sleep(3) response = conn.getresponse() print('current ip address :', response.read()) # Launch Tor and connect to Tor network sproc = launchTor() connectTor() # list of url to scrape url_list = [list of all the urls you want to scrape] for url in url_list: # set new ip and check ip before scraping for each new url set_new_ip() # allow some time for IP address to refresh time.sleep(5) checkIP() ''' [insert your scraping code here: bs4, urllib, your usual thingy] ''' # remember to kill process killTor(sproc)

This scenario above will update the IP address for each URL that you want to clear. Just make sure to sleep long enough for the IP to change. The last one was yesterday. Hope this helps!

Steve Lockwood · Answer 12 · 2018-03-20 18:04

To expand the above comment about using torify and the Tor browser (and does not need Privoxy):

 pip install PySocks pip install pyTorify

(install the Tor browser and run it)

Using the command line:

 python -mtorify -p 127.0.0.1:9150 your_script.py

Or built into the script:

 import torify torify.set_tor_proxy("127.0.0.1", 9150) torify.disable_tor_check() torify.use_tor_proxy() # use urllib as normal import urllib.request req = urllib.request.Request("http://....") req.add_header("Referer", "http://...") # etc res = urllib.request.urlopen(req) html = res.read().decode("utf-8")

Please Note: Tor Browser Uses Port 9150, Not 9050

How to make urllib2 requests via Tor in Python?

More articles: