Suggestions for distributing python data / code over work nodes?

Question

Suggestions for distributing python data / code over work nodes?

I'm starting to venture into distributed code, and it's hard for me to figure out which solution fits my needs based on all of this. Basically, I have a list of python data that I need to process with a single function. This function has several nested loops, but does not take too much time (about a minute) for each item in the list. My problem is that the list is very large (3000+ items). I am considering multiprocessing, but I think I want to experiment with multiserver processing (because ideally, if the data becomes more and more, I want to be able to add more servers during the job to speed things up),

I'm basically looking for something that I can distribute this list of data through (and not super, but it would be nice if I could distribute my code base also through this)

So my question is: which package can I use to achieve this? My database is hbase, so I already used hasoop (I never used hasoop, just using it for the database). I looked at celery and was distorted, but I was confused that would fit my needs.

Any suggestions?

+6

python twisted hadoop celery distributed

Lostsoul Feb 16 '12 at 20:54

source share

2 answers

check rabbitMQ . Python associations are available through pika . start with a simple work_queue and make some rpc calls .

It may seem difficult to experiment with distributed computing in python with an external engine such as rabbitMQ (there is a small learning curve for installing and configuring a rabbit), but you may find it even more useful later.

... and celery can work hand in hand with rabbitMQ, checkout robert pogorzelski tutorial and Simple distributed tasks with Celery and RabbitMQ

+2

user237419 Feb 16 '12 at 21:05

source share

jterrace · Accepted Answer · 2012-02-16T21:05:52+0000

I would highly recommend celery . You can define a task that works with one element of your list:

from celery.task import task @task def process(i): # do something with i i += 1 # return a result return i

You can easily parallelize the list as follows:

 results = [] todo = [1,2,3,4,5] for arg in todo: res = process.apply_async(args=(arg)) results.append(res) all_results = [res.get() for res in results]

It scales easily by simply adding more celery workers.

Suggestions for distributing python data / code over work nodes?

More articles: