R foreach: from one machine to a cluster

Question

R foreach: from one machine to a cluster

The following (simplified) script works fine on the main node of a unix cluster (4 virtual cores).

library(foreach) library(doParallel) nc = detectCores() cl = makeCluster(nc) registerDoParallel(cl) foreach(i = 1:nrow(data_frame_1), .packages = c("package_1","package_2"), .export = c("variable_1","variable_2")) %dopar% { row_temp = data_frame_1[i,] function(argument_1 = row_temp, argument_2 = variable_1, argument_3 = variable_2) } stopCluster(cl)

I would like to use 16 nodes in a cluster ( 16 * 4 virtual cores in general).

I think all I need to do is change the parallel backend specified by makeCluster . But how do I do this? The documentation is not very clear.

Based on this rather old (2013) post http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I should change the default type ( sock or MPI - which one will work on unix ?)

EDIT

From this vignette, authors foreach:

By default, doParallel uses multi-core functions for Unix-like systems and snow functions in Windows. Note that multi-core functionality runs tasks on only one computer, not a cluster of computers. However, you can use the snow function to execute a cluster using Unix-like operating systems, Windows, or even a combination.

What does you can use the snow functionality mean? How can I do it?

+5

parallel-processing r cluster-computing snow parallel-foreach

Antoine Apr 22 '16 at 12:38

source share

2 answers

Here's a partial answer that may send you in the right direction.

Based on this rather old (2013) post of http://www.r-bloggers.com/the-wonders-of-foreach/ it seems that I should change the default type (fork to MPI, but why? To unix?)

fork is a way to spawn background processes on a POSIX system. on a single node with n kernels, you can run n processes in parallel and work. this does not work on multiple computers because they do not use memory. you need a way to get data between them.

MPI is a portable way to communicate between clusters of nodes. The MPI cluster can work through nodes.

What can you use to use the snow function? How can I do it?

snow is a separate package. To create a 16 node MPI cluster with snow, do cl <- makeCluster(16, type = "MPI") , but you need to run R in the correct environment, as described in Steve Weston's answer and his answer to a similar question here . (Once you run it, you may also need to change your loop to use 4 cores on each node.)

+2

jaimedash Apr 22 '16 at 21:29

source share

Steve weston · Accepted Answer · 2016-04-25T02:30:16+0000

A parallel package is a merger of multicore and snow packages, but if you want to run on multiple nodes, you must use the "snow function" in parallel (that is, the parallel part obtained from snow ). In fact, this means that you need to call makeCluster with the argument “type” set to “PSOCK”, “SOCK”, “MPI” or “NWS”, because these are the only cluster types supported by the current parallel version that support multi-execution nodes. If you are using a cluster managed by knowledgeable HPC system administrators, you should use "MPI", otherwise it might be easier to use "PSOCK" (or "SOCK" if you have a specific reason to use the "snow" package).

If you decide to create an "MPI" cluster, you must run the script through R using the mpirun command with the "-n 1" parameter, and the first makeCluster argument to the number of workers that should be spawned. (If you do not know what this means, you may not want to use this approach.)

If you decide to create a "PSOCK" or "SOCK" cluster, the first argument to makeCluster must be a host name vector, and makeCluster will start workers on these nodes using the "ssh" makeCluster when makeCluster is running. This means that you must have ssh daemons running on all the specified nodes.

I wrote a lot more on this topic elsewhere, but hopefully this helps you get started.

R foreach: from one machine to a cluster

More articles: