Using a multiprocessor module for cluster computing

I am interested in running a Python program using a computer cluster. In the past, I used Python MPI interfaces, but because of the difficulties with compiling and installing them, I would prefer solutions that use built-in modules such as Python multiprocessing .

What I really would like to do is just create an instance of multiprocessing.Pool that will span the entire cluster of computers and run Pool.map(...) . Is it possible / easy to do something?

If this is not possible, I would like to at least run Process instances on any of the nodes from the central script with different parameters for each node.

+48
python parallel-processing multiprocessing
Mar 03 2018-11-11T00:
source share
4 answers

If by cluster computing you mean distributed memory systems (multiple nodes, not SMPs), then Python multiprocessing might not be the right choice. It can spawn multiple processes, but they will still be connected within the same node.

You will need an infrastructure that processes processing processes on several nodes and provides a mechanism for interaction between processors. (pretty much what MPI does).

See the Parallel Processing page on the Python wiki for a list of frameworks that will help with cluster computing.

From the list, pp , jug , pyro and celery look like reasonable options, although I can not personally vouch for anyone, since I have no experience with any of them (I mainly use MPI).

If ease of installation / use is important, I'll start by exploring jug . easy to install , supports general package cluster systems and looks well-documented .

+41
Mar 03 2018-11-11T00:
source share

I used to use Pyro to do this quite successfully. If you enable the mobile code, it will automatically send the necessary modules by wire, which the nodes do not yet have. Pretty elegant.

+13
Mar 03 2018-11-11T00:
source share

I was fortunate to use SCOOP as an alternative to multiprocessing for use in single or multi-computer mode and gaining benefits when presenting a job for clusters, as well as many other functions such as nested maps and minimal code changes to work with map ().

The source is available on Github. A quick example shows how simple the implementation can be!

+1
May 25 '18 at 16:33
source share

If you are ready to install an open source package, you should consider Ray , which of the Python frameworks is probably the closest option to single-threaded Python. This allows you to parallelize both functions (as tasks) and classes with state (as actors), and automatically performs all delivery and serialization of data, as well as the distribution of exception messages. It also provides flexibility similar to regular Python (actors can be transferred, tasks can cause other tasks, there may be arbitrary data dependencies, etc.). Read more about this in the documentation .

As an example, here is how you would make your example of a multiprocessor card in Ray:

 import ray ray.init() @ray.remote def mapping_function(input): return input + 1 results = ray.get([mapping_function.remote(i) for i in range(100)]) 

The API is slightly different from the multiprocessor Python API, but should be easier to use.

You can install Ray using "pip install ray" and then run the above code on one node, or it is also easy to configure a cluster, see Cloud Support and Cluster Support.

Disclaimer: I am one of the developers of Ray.

0
Feb 06 '19 at 22:30
source share



All Articles