How to get faster speed when using multithreading in python

Now I'm learning how to retrieve data from a site as quickly as possible. To get a higher speed, I am considering using multi-threaded threads. Here is the code I used to check the difference between a multi-threaded and a simple message.

import threading import time import urllib import urllib2 class Post: def __init__(self, website, data, mode): self.website = website self.data = data #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST) self.mode = mode def post(self): #post data req = urllib2.Request(self.website) open_url = urllib2.urlopen(req, self.data) if self.mode == "Multiple": time.sleep(0.001) #read HTMLData HTMLData = open_url.read() print "OK" if __name__ == "__main__": current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \ "Simple") #save the time before post data origin_time = time.time() if(current_post.mode == "Multiple"): #multithreading POST for i in range(0, 10): thread = threading.Thread(target = current_post.post) thread.start() thread.join() #calculate the time interval time_interval = time.time() - origin_time print time_interval if(current_post.mode == "Simple"): #simple POST for i in range(0, 10): current_post.post() #calculate the time interval time_interval = time.time() - origin_time print time_interval 

as you can see, this is a very simple code. first I set the mode to โ€œSimpleโ€ and I can get the time interval: 50 seconds (maybe my speed is a little slow :(), then I set the mode to โ€œSeveralโ€ and I get the time interval: 35 , from which I can see multithreading can actually increase the speed, but the result is not as good as I imagine, I want to get a much higher speed.

From debugging, I found that the program is basically blocked in the line: open_url = urllib2.urlopen(req, self.data) , this line of code takes a long time to send and receive data from the specified website. maybe I can get a faster speed by adding time.sleep() and using multithreading inside the urlopen function, but I cannot do this because its own python function.

if you do not take into account the permissible limits that the server blocks the message speed, what else can I do to get a faster speed? or any other code that I can change? thanks a lot!

+6
source share
4 answers

In many cases, python threads do not improve execution speed very well ... sometimes this makes it even worse. For more information, see David Beazley PyCon2010's presentation on Global Interpreter / Pycon2010 GIL Blocking Slides . This presentation is very informative, I highly recommend it to anyone who views streams ...

You must use the multiprocessing module . I have included this as an option in your code (see bottom of my answer).

Running this on one of my old machines (Python 2.6.6):

 current_post.mode == "Process" (multiprocessing) --> 0.2609 seconds current_post.mode == "Multiple" (threading) --> 0.3947 seconds current_post.mode == "Simple" (serial execution) --> 1.650 seconds 

I agree with TokenMacGuy's comment, and the numbers above include moving .join() to another loop. As you can see, python multiprocessor is much faster than threads.


 from multiprocessing import Process import threading import time import urllib import urllib2 class Post: def __init__(self, website, data, mode): self.website = website self.data = data #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST) self.mode = mode def post(self): #post data req = urllib2.Request(self.website) open_url = urllib2.urlopen(req, self.data) if self.mode == "Multiple": time.sleep(0.001) #read HTMLData HTMLData = open_url.read() print "OK" if __name__ == "__main__": current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \ "Process") #save the time before post data origin_time = time.time() if(current_post.mode == "Multiple"): #multithreading POST threads = list() for i in range(0, 10): thread = threading.Thread(target = current_post.post) thread.start() threads.append(thread) for thread in threads: thread.join() #calculate the time interval time_interval = time.time() - origin_time print time_interval if(current_post.mode == "Process"): #multiprocessing POST processes = list() for i in range(0, 10): process = Process(target=current_post.post) process.start() processes.append(process) for process in processes: process.join() #calculate the time interval time_interval = time.time() - origin_time print time_interval if(current_post.mode == "Simple"): #simple POST for i in range(0, 10): current_post.post() #calculate the time interval time_interval = time.time() - origin_time print time_interval 
+5
source

The biggest thing you do wrong is that it hurts your bandwidth, this is how you call thread.start() and thread.join() :

 for i in range(0, 10): thread = threading.Thread(target = current_post.post) thread.start() thread.join() 

Each time in a loop, you create a thread, start it, and then wait for it to finish . Before moving on to the next stream . You are not doing anything at the same time!

Instead, you should:

 threads = [] # start all of the threads for i in range(0, 10): thread = threading.Thread(target = current_post.post) thread.start() threads.append(thread) # now wait for them all to finish for thread in threads: thread.join() 
+7
source

A DNS lookup takes time. There is nothing you can do about it. Large delays are one of the reasons for using multiple threads, in the first place - multiple searches of the GET / POST site can occur in parallel.

Dump sleep () - this does not help.

0
source

Keep in mind that the only case that multithreading can "speed up" in Python is when you have operations like this, which is heavily related to I / O. Otherwise, multithreading does not increase the "speed", because it cannot work on several processors (no, even if you have several cores, python does not work that way). You should use multithreading if you want two things to be executed simultaneously, and not when you want two things to be parallel (i.e., Two processes performed separately).

Now what you are actually doing will not actually increase the speed of any DNS lookup, but it will allow you to spin up a few queries awaiting the results of some others, but you have to be careful how many you do, or you just make the response time worse than they are.

Also stop using urllib2 and use Requests: http://docs.python-requests.org

0
source

Source: https://habr.com/ru/post/913206/


All Articles