Urllib2 & BeautifulSoup: Nice couple, but too slow - urllib3 and themes?

Question

Urllib2 & BeautifulSoup: Nice couple, but too slow - urllib3 and themes?

I was looking for a way to optimize my code when I heard good things about threads and urllib3. People seem to disagree on which solution is the best.

The problem with my script below is the runtime: so slow!

Step 1 : I retrieve this page http://www.cambridgeesol.org/institutions/results.php?region=Afghanistan&type=&BULATS=on

Step 2 : I am viewing the page using BeautifulSoup

Step 3: I put the data in an excel document

Step 4: I do it again and again and again for all countries on my list (large list) (I just change “Afghanistan” in the URL to another country)

Here is my code:

ws = wb.add_sheet("BULATS_IA") #We add a new tab in the excel doc x = 0 # We need x and y for pulling the data into the excel doc y = 0 Countries_List = ['Afghanistan','Albania','Andorra','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam'] Longueur = len(Countries_List) for Countries in Countries_List: y = 0 htmlSource = urllib.urlopen("http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % (Countries)).read() # I am opening the page with the name of the correspondant country in the url s = soup(htmlSource) tableGood = s.findAll('table') try: rows = tableGood[3].findAll('tr') for tr in rows: cols = tr.findAll('td') y = 0 x = x + 1 for td in cols: hum = td.text ws.write(x,y,hum) y = y + 1 wb.save("%s.xls" % name_excel) except (IndexError): pass

So, I know that everything is not perfect, but I look forward to learning new things in Python! The script is very slow because urllib2 is not so fast, and BeautifulSoup. For soup, I think I can not do it better, but for urllib2 I do not.

EDIT 1: Multiprocessing is useless with urllib2? It seems interesting in my case. What do you think of this potential solution?

 # Make sure that the queue is thread-safe!! def producer(self): # Only need one producer, although you could have multiple with fh = open('urllist.txt', 'r'): for line in fh: self.queue.enqueue(line.strip()) def consumer(self): # Fire up N of these babies for some speed while True: url = self.queue.dequeue() dh = urllib2.urlopen(url) with fh = open('/dev/null', 'w'): # gotta put it somewhere fh.write(dh.read())

EDIT 2: URLLIB3 Can someone tell me more about this?

Reuse a single socket connection for multiple requests (HTTPConnectionPool and HTTPSConnectionPool) (with optional client-side certificate verification). https://github.com/shazow/urllib3

As far as I request 122 times the same site for different pages, I think that reusing the same socket connection can be interesting, am I mistaken? Could not be faster? ...

 http = urllib3.PoolManager() r = http.request('GET', 'http://www.bulats.org') for Pages in Pages_List: r = http.request('GET', 'http://www.bulats.org/agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=%s' % (Pages)) s = soup(r.data)

+6

performance python multithreading urllib2 beautifulsoup

Carto_ Apr 22 '12 at 3:59

source share

3 answers

I don't think urllib or BeautifulSoup is slow. I run my code on my local machine with a modified version (deleted excel stuff). It took about 100 ms to open the connection, download the content, analyze it and print it on the console for the country.

10ms is the total amount of time BeautifulSoup spent analyzing the content and printing to the console in each country. It is fast enough.

I also do not believe that using Scrappy or Threading will solve the problem. Because the problem is the expectation that it will be fast.

Welcome to the world of HTTP. Sometimes it will be slow, sometimes it will be very fast. A couple of slow connection reasons.

due to the server processing your request (sometimes returns 404)
Allow DNS
HTTP Verification
the stability of your internet connection,
the speed of your bandwidth
packet loss rate

etc..

Do not forget that you are trying to make a 121 HTTP request to the server, and you do not know which servers they have. They may also prohibit your IP address due to consecutive calls.

Take a look at Requests lib. Read their documentation. If you do this to learn more about Python, do not go directly to Scrapy.

+2

Bahadir cambel Apr 22 '12 at 8:36

source share

Hi guys,

Some news from this problem! I found this script that may be useful! I actually test it and promise (6.03 to run the script below).

My idea is to find a way to mix this with urllib3. In effet, I make a request on the same host many times.

PoolManager will take care of reusing connections for you whenever you request the same host. this should cover most scenarios without a significant loss in efficiency, but you can always have a lower level component for more granular control. (urrlib3 doc website)

In any case, this seems very interesting, and if I don’t yet see how to mix these two functionalities (urllib3 and threading script below), I think this is doable !:-)

Thank you for taking the time to give me a hand with this, It smells good!

 import Queue import threading import urllib2 import time from bs4 import BeautifulSoup as BeautifulSoup hosts = ["http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=1", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=2", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=3", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=4", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=5", "http://www.bulats.org//agents/find-an-agent?field_continent_tid=All&field_country_tid=All&page=6"] queue = Queue.Queue() out_queue = Queue.Queue() class ThreadUrl(threading.Thread): """Threaded Url Grab""" def __init__(self, queue, out_queue): threading.Thread.__init__(self) self.queue = queue self.out_queue = out_queue def run(self): while True: #grabs host from queue host = self.queue.get() #grabs urls of hosts and then grabs chunk of webpage url = urllib2.urlopen(host) chunk = url.read() #place chunk into out queue self.out_queue.put(chunk) #signals to queue job is done self.queue.task_done() class DatamineThread(threading.Thread): """Threaded Url Grab""" def __init__(self, out_queue): threading.Thread.__init__(self) self.out_queue = out_queue def run(self): while True: #grabs host from queue chunk = self.out_queue.get() #parse the chunk soup = BeautifulSoup(chunk) #print soup.findAll(['table']) tableau = soup.find('table') rows = tableau.findAll('tr') for tr in rows: cols = tr.findAll('td') for td in cols: texte_bu = td.text texte_bu = texte_bu.encode('utf-8') print texte_bu #signals to queue job is done self.out_queue.task_done() start = time.time() def main(): #spawn a pool of threads, and pass them queue instance for i in range(5): t = ThreadUrl(queue, out_queue) t.setDaemon(True) t.start() #populate queue with data for host in hosts: queue.put(host) for i in range(5): dt = DatamineThread(out_queue) dt.setDaemon(True) dt.start() #wait on the queue until everything has been processed queue.join() out_queue.join() main() print "Elapsed Time: %s" % (time.time() - start)

0

Carto_ Apr 23 '12 at 8:52

source share

shazow · Accepted Answer · 2012-04-24T04:41:29+0000

Think using something like workerpool . Turning to Mass Downloader , in combination with urllib3 will look something like this:

 import workerpool import urllib3 URL_LIST = [] # Fill this from somewhere NUM_SOCKETS = 3 NUM_WORKERS = 5 # We want a few more workers than sockets so that they have extra # time to parse things and such. http = urllib3.PoolManager(maxsize=NUM_SOCKETS) workers = workerpool.WorkerPool(size=NUM_WORKERS) class MyJob(workerpool.Job): def __init__(self, url): self.url = url def run(self): r = http.request('GET', self.url) # ... do parsing stuff here for url in URL_LIST: workers.put(MyJob(url)) # Send shutdown jobs to all threads, and wait until all the jobs have been completed # (If you don't do this, the script might hang due to a rogue undead thread.) workers.shutdown() workers.wait()

You can note from the examples of Mass Downloader that there are several ways to do this. I chose this specific example only because it is less magical, but any of the other strategies are also valid.

Disclaimer: I am the author of both, urllib3 and workerpool.

Urllib2 & BeautifulSoup: Nice couple, but too slow - urllib3 and themes?

More articles: