Does Perl outperform Python in choosing HTML pages?

Question

Does Perl outperform Python in choosing HTML pages?

I have a Perl script that retrieves html pages. I tried rewriting it in Python (Just Coz, I'm trying to learn Python), and I found it to be very slow!

Here's a test script in Perl

#!/usr/bin/perl use LWP::Simple; $url = "http://majorgeeks.com/page.php?id="; open(WEB,">>"."perldata.txt"); for ($column = 1 ; $column <= 20 ; $column ++) { $temp = $url.$column; print "val = $temp\n\n"; $response=get($temp)or die("[-] Failed!!\n"); print WEB "$response\n\n"; }

And here is the equivalent code in Python

 import urllib2 url = "http://majorgeeks.com/page.php?id=" f = open("pydata.txt", 'w') for i in range(20): tempurl = url + str(i+1) print "Val : " + tempurl + "\n\n" #req = urllib2.Request(tempurl) res = urllib2.urlopen(tempurl) f.write(res.read()) f.close()

The difference I found is huge! The perl script completed in about 30 seconds. While the Python script took 7 min. (420 seconds)!

I use Ubuntu 11.10, 64bit, Core i7, tested it on 12MBPS connection. I tried this several times, and each time I get the same difference.

Am I doing something wrong here? Or do I need to do something? Or is the difference justified? (I hope not)

Many thanks for your help.

Update 3: I just got home and downloaded my laptop, ran the code again and finished it in 11 seconds !!!: / Is it because I "rebooted" my computer? Here is the Profiler output

Note. Perl still takes 31 seconds to do the same!: /

Update 2: As suggested by @Makoto Here are the profiler data I made. And it is very slow! I know some kind of python configuration is related to this, but I don't know what. For one simple request, it should not take 20 seconds.

Update: Fixed url for tempurl. Commented on urllib2.Request as suggested here. Not a big difference.

+6

python html perl

firesofmay Jan 6 '12 at 20:56

source share

3 answers

Tadeck · Answer 1 · 2012-01-06T21:04:48+0000

Your code can be improved, although I'm not sure that it will fix all performance issues:

 from urllib2 import urlopen url = "http://majorgeeks.com/page.php?id={}" with open("pydata.txt", 'w') as f: for i in xrange(1, 21): tempurl = url.format(i) print "Val : {}\n\n".format(tempurl) f.write(urlopen(tempurl).read())

I also changed it logically - it now requests different URLs (defined by tempurl ), it requested the same URL 20 times (determined by url ). I also used string formatting, although I'm not sure how this affects performance.

I tested it on my system (64-bit version of Windows 7, Python 2.7.2, in IDLE mode, moderate Internet connection), and it took 40 seconds to complete (40.262).

Makoto · Answer 2 · 2012-01-07T01:29:25+0000

I still have to scratch my head and understand why this code has been taking so long for both @mayjune and @Tadeck. I had the opportunity to run both code snippets through the profiler, and here are the results. I highly recommend that you run these results for yourself on your machine, as mine will give different results (AMD Athlon II X4 @ 3GHz, 8 GB RAM, Ubuntu 11.04 x64, 7Mbit line).

For start:

python -m cProfile -o profile.dat <path/to/code.py>; python -m pstats profile.dat

(From inside the profiler, you can check the help for the commands.)

Original code:

 Fri Jan 6 17:49:29 2012 profile.dat 20966 function calls (20665 primitive calls) in 13.566 CPU seconds Ordered by: cumulative time List reduced from 306 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 13.567 13.567 websiteretrieval.py:1(<module>) 20 0.000 0.000 7.874 0.394 /usr/lib/python2.7/urllib2.py:122(urlopen) 20 0.000 0.000 7.874 0.394 /usr/lib/python2.7/urllib2.py:373(open) 20 0.000 0.000 7.870 0.394 /usr/lib/python2.7/urllib2.py:401(_open) 40 0.000 0.000 7.870 0.197 /usr/lib/python2.7/urllib2.py:361(_call_chain) 20 0.000 0.000 7.870 0.393 /usr/lib/python2.7/urllib2.py:1184(http_open) 20 0.001 0.000 7.870 0.393 /usr/lib/python2.7/urllib2.py:1112(do_open) 1178 7.596 0.006 7.596 0.006 {method 'recv' of '_socket.socket' objects} 20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:953(request) 20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:974(_send_request) 20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:938(endheaders) 20 0.000 0.000 5.911 0.296 /usr/lib/python2.7/httplib.py:796(_send_output) 20 0.000 0.000 5.910 0.296 /usr/lib/python2.7/httplib.py:769(send) 20 0.000 0.000 5.909 0.295 /usr/lib/python2.7/httplib.py:751(connect) 20 0.001 0.000 5.909 0.295 /usr/lib/python2.7/socket.py:537(create_connection)

... so from observation the only thing that could slow you down is ... urlopen and open . I / O is slow, so this is understandable.

Revised code

 Fri Jan 6 17:52:36 2012 profileTadeck.dat 21008 function calls (20707 primitive calls) in 13.249 CPU seconds Ordered by: cumulative time List reduced from 305 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.002 0.002 13.249 13.249 websiteretrievalTadeck.py:1(<module>) 20 0.000 0.000 7.706 0.385 /usr/lib/python2.7/urllib2.py:122(urlopen) 20 0.000 0.000 7.706 0.385 /usr/lib/python2.7/urllib2.py:373(open) 20 0.000 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:401(_open) 40 0.000 0.000 7.702 0.193 /usr/lib/python2.7/urllib2.py:361(_call_chain) 20 0.000 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:1184(http_open) 20 0.001 0.000 7.702 0.385 /usr/lib/python2.7/urllib2.py:1112(do_open) 1178 7.348 0.006 7.348 0.006 {method 'recv' of '_socket.socket' objects} 20 0.000 0.000 5.841 0.292 /usr/lib/python2.7/httplib.py:953(request) 20 0.000 0.000 5.841 0.292 /usr/lib/python2.7/httplib.py:974(_send_request) 20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:938(endheaders) 20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:796(_send_output) 20 0.000 0.000 5.840 0.292 /usr/lib/python2.7/httplib.py:769(send) 20 0.000 0.000 5.839 0.292 /usr/lib/python2.7/httplib.py:751(connect) 20 0.001 0.000 5.839 0.292 /usr/lib/python2.7/socket.py:537(create_connection)

Again, the biggest two culprits of the time spent are on urlopen and open . This makes me think that I / O plays an important role in linking your code. However, the difference is not significant on the machine on which I tested it - the Perl script runs at about the same time.

 real 0m11.129s user 0m0.230s sys 0m0.070s

I'm not sure if this is a software bug that your code is slow, although your machine is rather meaty. I highly recommend you run the profiler package (the code is included above) to find out if you can find any bottlenecks that I missed.

reclosedev · Answer 3 · 2012-01-06T21:37:48+0000

I really don't know why you get such weird results. But let me advise a very quick solution. Use multiple asynchronous libs. I like gevent , with a very nice interface in the query library

the code:

 from requests import async import time begin = time.time() url = "http://majorgeeks.com/page.php?id=%s" rs = [async.get(url % i) for i in xrange(1, 21)] responses = async.map(rs, size=10) with open("pydata.txt", 'w') as f: for response in responses: print response.url f.write(response.content) print 'Elapsed:', (time.time()-begin)

It takes only 2.45 seconds.

EDIT

Possible causes of slow urllib2.urlopen :

http_proxy in the system environment
site slows down somehow urllib2 agent to limit automatic crawl

Does Perl outperform Python in choosing HTML pages?

Original code:

Revised code

More articles: