Always run a constant number of subprocesses in parallel

I want to use subprocesses so that 20 instances of the written script are executed in parallel. Suppose I have a large list of URLs with 100,000 entries, and my program should control that all 20 instances of my script work on this list. I wanted to encode it like this:

urllist = [url1, url2, url3, .. , url100000] i=0 while number_of_subproccesses < 20 and i<100000: subprocess.Popen(['python', 'script.py', urllist[i]] i = i+1 

My script just writes something to a database or text file. It does not output anything and does not need more input than the URL.

My problem is that I could not find something to get the number of active subprocesses. I am a beginner programmer, so every hint and suggestion is welcome. I am also wondering how I can control it when 20 subprocesses are loaded, what while loop checks conditions again? I was thinking that maybe I would put another loop over it, something like

 while i<100000 while number_of_subproccesses < 20: subprocess.Popen(['python', 'script.py', urllist[i]] i = i+1 if number_of_subprocesses == 20: sleep() # wait to some time until check again 

Or maybe there is a chance that the while loop always checks the number of subprocesses?

I also considered the possibility of using the multiprocessing module, but it was very convenient for me to simply call script.py with a subprocess instead of a function with multiprocessing.

Maybe someone can help me and lead me in the right direction. Thanks Alo!

+4
source share
3 answers

Taking a different approach from the above - as it seems that the callback cannot be sent as a parameter:

 NextURLNo = 0 MaxProcesses = 20 MaxUrls = 100000 # Note this would be better to be len(urllist) Processes = [] def StartNew(): """ Start a new subprocess if there is work to do """ global NextURLNo global Processes if NextURLNo < MaxUrls: proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit]) print ("Started to Process %s", urllist[NextURLNo]) NextURLNo += 1 Processes.append(proc) def CheckRunning(): """ Check any running processes and start new ones if there are spare slots.""" global Processes global NextURLNo for p in range(len(Processes):0:-1): # Check the processes in reverse order if Processes[p].poll() is not None: # If the process hasn't finished will return None del Processes[p] # Remove from list - this is why we needed reverse order while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots StartNew() if __name__ == "__main__": CheckRunning() # This will start the max processes running while (len(Processes) > 0): # Some thing still going on. time.sleep(0.1) # You may wish to change the time for this CheckRunning() print ("Done!") 
+6
source

Just keep counting how you run them, and use the callback from each subprocess to start a new one if there are any entries in the list of URLs to process.

eg. Assuming your subprocess calls the OnExit method, passed to it upon completion:

 NextURLNo = 0 MaxProcesses = 20 NoSubProcess = 0 MaxUrls = 100000 def StartNew(): """ Start a new subprocess if there is work to do """ global NextURLNo global NoSubProcess if NextURLNo < MaxUrls: subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit]) print "Started to Process", urllist[NextURLNo] NextURLNo += 1 NoSubProcess += 1 def OnExit(): NoSubProcess -= 1 if __name__ == "__main__": for n in range(MaxProcesses): StartNew() while (NoSubProcess > 0): time.sleep(1) if (NextURLNo < MaxUrls): for n in range(NoSubProcess,MaxProcesses): StartNew() 
+1
source

To maintain a constant number of concurrent requests, you can use the thread pool:

 #!/usr/bin/env python from multiprocessing.dummy import Pool def process_url(url): # ... handle a single url urllist = [url1, url2, url3, .. , url100000] for _ in Pool(20).imap_unordered(process_url, urllist): pass 

To start processes instead of threads, remove .dummy from the import.

0
source

Source: https://habr.com/ru/post/1495921/


All Articles