First of all, the passive RMA mechanism somehow does not magically plunge into the memory of the remote process, since not many MPI transports have real RDMA capabilities, and even those that do (for example, InfiniBand) require a lot of non-passive target involvement to ensure that passive RMA operations are performed. This is explained by the MPI standard, but in a very abstract form of public and private copies of memory opened through an RMA window.
Achieving a working and portable passive RMA with MPI-2 involves several steps.
Step 1: Distribute windows in the target process
To ensure mobility and performance, the window memory should be allocated using MPI_ALLOC_MEM
:
int size; MPI_Comm_rank(MPI_COMM_WORLD, &size); int *schedule; MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule); for (int i = 0; i < size; i++) { schedule[i] = 0; } MPI_Win win; MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); ... MPI_Win_free(win); MPI_Free_mem(schedule);
Step 2: synchronize memory in the target
The MPI standard prohibits simultaneous access to the same location in a window (§11.3 of the MPI-2.2 specification):
It is a mistake to have simultaneous conflicting accesses to the same memory location in a window; if the location is updated with a put or accumulate operation, then this location cannot be accessed by the load or other RMA operation until the update operation in the target is completed.
Therefore, every access to schedule[]
in the target must be protected by a lock (shared, since it only reads the memory location):
while (!ready) { MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win); ready = fnz(schedule, oldschedule, size); MPI_Win_unlock(0, win); }
Another reason for locking the window in the target is to provide entries in the MPI library and thus facilitate the promotion of the local part of the RMA operation. MPI provides portable RMA even when using vehicles that do not support RDMA, for example. TCP / IP or shared memory, and this requires a lot of active work (called progression) to support “passive” RMA. Some libraries provide asynchronous progression streams that can advance the operation in the background, for example. Open MPI when configuring with --enable-opal-multi-threads
(disabled by default), but using this behavior leads to program intolerance. That is why the MPI standard allows the following relaxed semantics of the put operation (§11.7, p. 365):
6. Updating by calling or accumulating a call into a copy of an open window becomes visible in a private copy in the process memory no later than when a subsequent call to MPI_WIN_WAIT, MPI_WIN_FENCE or MPI_WIN_LOCK is made in this window by the window owner.
If synchronization with the premises or storage was synchronized with the lock, the update of the copy of the open window is completed as soon as the update process is performed by MPI_WIN_UNLOCK. On the other hand, updating a private copy in the process memory may be delayed until the target process makes a synchronization call in this window (6). Thus, updates in the process memory can always be delayed until the process completes the corresponding synchronization call. Updates to a copy of an open window can also be delayed until the owner of the window makes a synchronization call, if the fence or after the start -wait synchronization. Only when synchronization synchronization is used does it become necessary to update a copy of an open window, even if the window owner does not make any associated synchronization calls.
This is also shown in Example 11.12 in the same section of the standard (p. 367). Indeed, both Open MPI and Intel MPI do not update the schedule[]
value if blocking / unlocking calls in the wizard code is ignored. The MPI Standard further recommends (§11.7, p. 366):
Tips for users. The user can write the correct program, following the following rules:
...
lock: Window updates are protected by exclusive locks if they can conflict. Non-conflicting calls (such as read-only access or access accumulation) are protected by shared gateways for both local access and RMA access .
Step 3: Providing the correct MPI_PUT
parameters at the beginning
MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win);
will pass everything to the first element of the target window. The correct call is provided that the window in the target was created using disp_unit == sizeof(int)
:
int one = 1; MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win);
The local value of one
thus transferred to rank * sizeof(int)
bytes after the start of the window in the target. If disp_unit
set to 1, the correct one will be:
MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win);
Step 4: Working with Implementation Features
The above detailed program is bundled with Intel MPI. With Open MPI, special care must be taken. The library is built on the basis of a set of frameworks and implementation modules. The osc
frame (one-way communication) comes in two implementations - rdma
and pt2pt
. By default (in Open MPI 1.6.x and possibly earlier) it is rdma
and for some reason it does not perform RMA operations on the target side when calling MPI_WIN_(UN)LOCK
, which leads to a deadlock behavior if another call communication ( MPI_BARRIER
in your case). The pt2pt
module, on the other hand, performs all operations as expected. Therefore, you need to run the program with Open MPI, as shown below, to specifically select the pt2pt
component:
$ mpiexec --mca osc pt2pt ...
The following is a complete working example of the C99 code:
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <mpi.h> // Compares schedule and oldschedule and prints schedule if different // Also displays the time in seconds since the first invocation int fnz (int *schedule, int *oldschedule, int size) { static double starttime = -1.0; int diff = 0; for (int i = 0; i < size; i++) diff |= (schedule[i] != oldschedule[i]); if (diff) { int res = 0; if (starttime < 0.0) starttime = MPI_Wtime(); printf("[%6.3f] Schedule:", MPI_Wtime() - starttime); for (int i = 0; i < size; i++) { printf("\t%d", schedule[i]); res += schedule[i]; oldschedule[i] = schedule[i]; } printf("\n"); return(res == size-1); } return 0; } int main (int argc, char **argv) { MPI_Win win; int rank, size; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { int *oldschedule = malloc(size * sizeof(int)); // Use MPI to allocate memory for the target window int *schedule; MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule); for (int i = 0; i < size; i++) { schedule[i] = 0; oldschedule[i] = -1; } // Create a window. Set the displacement unit to sizeof(int) to simplify // the addressing at the originator processes MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); int ready = 0; while (!ready) { // Without the lock/unlock schedule stays forever filled with 0s MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win); ready = fnz(schedule, oldschedule, size); MPI_Win_unlock(0, win); } printf("All workers checked in using RMA\n"); // Release the window MPI_Win_free(&win); // Free the allocated memory MPI_Free_mem(schedule); free(oldschedule); printf("Master done\n"); } else { int one = 1; // Worker processes do not expose memory in the window MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); // Simulate some work based on the rank sleep(2*rank); // Register with the master MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win); MPI_Win_unlock(0, win); printf("Worker %d finished RMA\n", rank); // Release the window MPI_Win_free(&win); printf("Worker %d done\n", rank); } MPI_Finalize(); return 0; }
Example output with 6 processes:
$ mpiexec --mca osc pt2pt -n 6 rma [ 0.000] Schedule: 0 0 0 0 0 0 [ 1.995] Schedule: 0 1 0 0 0 0 Worker 1 finished RMA [ 3.989] Schedule: 0 1 1 0 0 0 Worker 2 finished RMA [ 5.988] Schedule: 0 1 1 1 0 0 Worker 3 finished RMA [ 7.995] Schedule: 0 1 1 1 1 0 Worker 4 finished RMA [ 9.988] Schedule: 0 1 1 1 1 1 All workers checked in using RMA Worker 5 finished RMA Worker 5 done Worker 4 done Worker 2 done Worker 1 done Worker 3 done Master done