Parallel processing from the command line on Linux (bash, python, ruby ​​... whatever)

I have a list / queue of 200 commands that I need to run in a shell on a Linux server.

I want to have only 10 running processes (from the queue). Some processes will take several seconds to complete, while other processes will take much longer.

When the process is complete, I want the next command to be “popped” out of the queue and executed.

Does anyone have any code to solve this problem?

Further development:

There should be 200 work that needs to be done, in a queue. I want no more than 10 works to go on right away. When the thread finishes a piece of work, it should set the queue for the next work. If there is no more work in the queue, the thread must die. When all threads have passed away, this means that all work has been completed.

The actual problem I'm trying to solve is using imapsync to synchronize 200 mailboxes from the old mail server to the new mail server. Some users have large mailboxes and long tto synchronization, while others have very small mailboxes and synchronization.

+42
python ruby bash shell parallel-processing
Jan 21 '09 at 2:54
source share
12 answers

I would suggest that you can do this using make and the make -j xx command.

Maybe a makefile like this

 all : usera userb userc.... usera: imapsync usera userb: imapsync userb .... 

make -j 10 -f makefile

+38
Jan 21 '09 at 3:58
source share

In the shell, xargs can be used to process a parallel command in a queue. For example, in order to always 3 sleep in parallel, sleep for 1 second each and perform 10 sleepers in total

 echo {1..10} | xargs -d ' ' -n1 -P3 sh -c 'sleep 1s' _ 

And he will sleep only 4 seconds. If you have a list of names and you want to pass names to executable commands, again executing 3 commands in parallel, do

 cat names | xargs -n1 -P3 process_name 

Runs the command process_name alice , process_name bob , etc.

+42
Jan 21 '09 at 3:53
source share

Parallel is made specifically for this.

 cat userlist | parallel imapsync 

One of the beauty of Parallel over other solutions is that it ensures that the result is not mixed. Running traceroute in Parallel works fine, for example:

 (echo foss.org.my; echo www.debian.org; echo www.freenetproject.org) | parallel traceroute 
+24
Jan 27 '10 at 17:00
source share

For this type of work, PPSS says: Parallel shell script processing. Google for this name, and you will find it, I will not link.

+13
Mar 10 '09 at 0:31
source share

GNU make (and possibly other implementations) has an -j argument that determines how many jobs it will run at once. When the task completes, make will launch another one.

+7
Jan 21 '09 at 3:31
source share

Well, if they are largely independent of each other, I would think in terms of:

 Initialize an array of jobs pending (queue, ...) - 200 entries Initialize an array of jobs running - empty while (jobs still pending and queue of jobs running still has space) take a job off the pending queue launch it in background if (queue of jobs running is full) wait for a job to finish remove from jobs running queue while (queue of jobs is not empty) wait for job to finish remove from jobs running queue 

Note that a tail test in the main loop means that if the “job queue” occurs when the while loop repeats, this prevents the loop from ending prematurely. I think logic sounds.

I see how to do this in C quite easily - it will not be so difficult in Perl, and (and therefore not too difficult in other scripting languages ​​- Python, Ruby, Tcl, etc.). I'm not at all sure that I want to do this in the shell - the wait command in the shell expects that all children will complete, and not for any child to complete.

+4
Jan 21 '09 at 3:34
source share

In python you can try:

 import Queue, os, threading # synchronised queue queue = Queue.Queue(0) # 0 means no maximum size # do stuff to initialise queue with strings # representing os commands queue.put('sleep 10') queue.put('echo Sleeping..') # etc # or use python to generate commands, eg # for username in ['joe', 'bob', 'fred']: # queue.put('imapsync %s' % username) def go(): while True: try: # False here means no blocking: raise exception if queue empty command = queue.get(False) # Run command. python also has subprocess module which is more # featureful but I am not very familiar with it. # os.system is easy :-) os.system(command) except Queue.Empty: return for i in range(10): # change this to run more/fewer threads threading.Thread(target=go).start() 

Not verified...

(Of course, python itself is single-threaded. However, you still get the advantage of multiple threads awaiting I / O.)

+3
Jan 21 '09 at 3:59
source share

If you intend to use Python, I recommend using Twisted .

In particular, Twisted Runner .

+2
Jan 21 '09 at 2:59
source share

https://savannah.gnu.org/projects/parallel (gnu parallel) and pssh can help.

+2
Aug 29 '11 at 22:30
source share

The Python multiprocessing module seems to be well suited to your problem. This is a high-level package that supports process threads.

+1
Jan 21 '09 at 4:10
source share

A simple function in zsh for parallelizing jobs with up to 4 subshells using lock files in / tmp.

The only non-trivial part is the glob flags in the first test:

  • #q : enable file name #q in test
  • [4] : returns only the 4th result
  • N : ignore error with empty result

This should be easy to convert to posix, although it would be a bit more verbose.

Remember to avoid quotation marks in tasks with \" .

 #!/bin/zsh setopt extendedglob para() { lock=/tmp/para_$$_$((paracnt++)) # sleep as long as the 4th lock file exists until [[ -z /tmp/para_$$_*(#q[4]N) ]] { sleep 0.1 } # Launch the job in a subshell ( touch $lock ; eval $* ; rm $lock ) & # Wait for subshell start and lock creation until [[ -f $lock ]] { sleep 0.001 } } para "print A0; sleep 1; print Z0" para "print A1; sleep 2; print Z1" para "print A2; sleep 3; print Z2" para "print A3; sleep 4; print Z3" para "print A4; sleep 3; print Z4" para "print A5; sleep 2; print Z5" # wait for all subshells to terminate wait 
0
Sep 13 '17 at 16:36 on
source share

Can you clarify what you mean by parallel? It looks like you need to implement some kind of blocking in the queue so that your records are not selected twice, etc., and the commands only run once.

Most queue systems are deceiving - they just write a giant to-do list, then choose, for example. ten items, work with them and choose the next ten items. There is no parallelization.

If you provide more details, I am sure that we can help you.

-2
Jan 21 '09 at 3:08
source share



All Articles