MPI with C: passive RMA synchronization

since I still have not found the answer to my question and am on the verge of a crazy problem, I just ask a question that torments my mind; -)

I am working on parallelizing the -elimination algorithm that I have already programmed. The target environment is a cluster.

In my parallel program, I distinguish between a master process (in my case, rank 0) and working subordinates (each rank except 0). My idea is that the wizard tracks which slaves are available and then sends them. Therefore, for some other reason, I am trying to set up a workflow based on a passive RMA with lock-unlock sequences. I use an integer array called schedule, in which for each position in the array representing the rank, either 0 for the workflow or 1 for the available process (so if schedule [1] = 1 is available for work). If the process is running with its work, it places the array in master file 1, signaling its availability. The code I tried for this is as follows:

MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); // a exclusive window is locked on process 0 printf("Process %d:\t exclusive lock on process 0 started\n",myrank); MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); // the line myrank of schedule is put into process 0 printf("Process %d:\t put operation called\n",myrank); MPI_Win_unlock(0,win); // the window is unlocked 

It worked perfectly, especially when the master process was synchronized with the barrier until the end of the lock, because then the master exit was done after the put operation.

As a next step, I tried to regularly check the wizard to see if there are any slaves available or not. So I created a while loop to repeat until every process signaled its availability (I repeat that this is a program that teaches me the principles, I know that the implementation still does not do what I want). The loop is in the basic version, just printing out the schedule of my array, and then checking the fnz function, are there any other workflows besides the wizard:

 while(j!=1){ printf("Process %d:\t following schedule evaluated:\n",myrank); for(i=0;i<size;i++)printf("%d\t",schedule[i]);//print the schedule printf("\n"); j=fnz(schedule); } 

And then the concept exploded. After inverting the process and obtaining the required information in order to get it from the subordinates of the master instead of putting it with the help of the followers on the master, I found out that my main problem is to acquire a lock: the unlock command does not work, because in case of a positive, the lock does not is granted, and in case of receipt, the lock is granted only when the slave process is executed with its work and waits in the barrier. In my opinion, there should be a serious mistake in my thinking. You can’t think of passive RMA, that blocking can be achieved only when the target process is in the barrier synchronizing the entire communicator. Then I could just go with standard Send / Recv transactions. What I want to achieve is that process 0 works all the time in delegating work and the ability of RMA slaves to determine to whom it can delegate. Can someone help me and explain how I can get a break in process 0 so that other processes get locks?

Thank you in advance!

UPDATE: I am not sure that you have ever worked with a lock and just want to emphasize that I can get an updated copy of the remote memory window. If I get access from subordinates, blocking is granted only when the followers are waiting in the barrier. So I have to work, this process 0 performs lock-get-unlock, while processes 1 and 2 simulate work, so process 2 is significantly busy than one. what I expect as a result, that process 0 prints a schedule (0,1,0), because process 0 is not set at all, since process 1 is running with work, and process 2 is still running. In the next step, when process 2 is ready, I expect the result (0,1,1), since the slaves are ready for new work. I get that the slaves only provide a lock for process 0 when they wait in the barrier, so the first and only conclusion I get at all is the last one that I expect, showing that the lock was provided for each person first, when it was done with his work. Therefore, if someone can tell me when blocking can be provided by the target process instead of trying to confuse my knowledge of passive RMA, I would be very grateful

+4
source share
2 answers

First of all, the passive RMA mechanism somehow does not magically plunge into the memory of the remote process, since not many MPI transports have real RDMA capabilities, and even those that do (for example, InfiniBand) require a lot of non-passive target involvement to ensure that passive RMA operations are performed. This is explained by the MPI standard, but in a very abstract form of public and private copies of memory opened through an RMA window.

Achieving a working and portable passive RMA with MPI-2 involves several steps.

Step 1: Distribute windows in the target process

To ensure mobility and performance, the window memory should be allocated using MPI_ALLOC_MEM :

 int size; MPI_Comm_rank(MPI_COMM_WORLD, &size); int *schedule; MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule); for (int i = 0; i < size; i++) { schedule[i] = 0; } MPI_Win win; MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); ... MPI_Win_free(win); MPI_Free_mem(schedule); 

Step 2: synchronize memory in the target

The MPI standard prohibits simultaneous access to the same location in a window (§11.3 of the MPI-2.2 specification):

It is a mistake to have simultaneous conflicting accesses to the same memory location in a window; if the location is updated with a put or accumulate operation, then this location cannot be accessed by the load or other RMA operation until the update operation in the target is completed.

Therefore, every access to schedule[] in the target must be protected by a lock (shared, since it only reads the memory location):

 while (!ready) { MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win); ready = fnz(schedule, oldschedule, size); MPI_Win_unlock(0, win); } 

Another reason for locking the window in the target is to provide entries in the MPI library and thus facilitate the promotion of the local part of the RMA operation. MPI provides portable RMA even when using vehicles that do not support RDMA, for example. TCP / IP or shared memory, and this requires a lot of active work (called progression) to support “passive” RMA. Some libraries provide asynchronous progression streams that can advance the operation in the background, for example. Open MPI when configuring with --enable-opal-multi-threads (disabled by default), but using this behavior leads to program intolerance. That is why the MPI standard allows the following relaxed semantics of the put operation (§11.7, p. 365):

6. Updating by calling or accumulating a call into a copy of an open window becomes visible in a private copy in the process memory no later than when a subsequent call to MPI_WIN_WAIT, MPI_WIN_FENCE or MPI_WIN_LOCK is made in this window by the window owner.

If synchronization with the premises or storage was synchronized with the lock, the update of the copy of the open window is completed as soon as the update process is performed by MPI_WIN_UNLOCK. On the other hand, updating a private copy in the process memory may be delayed until the target process makes a synchronization call in this window (6). Thus, updates in the process memory can always be delayed until the process completes the corresponding synchronization call. Updates to a copy of an open window can also be delayed until the owner of the window makes a synchronization call, if the fence or after the start -wait synchronization. Only when synchronization synchronization is used does it become necessary to update a copy of an open window, even if the window owner does not make any associated synchronization calls.

This is also shown in Example 11.12 in the same section of the standard (p. 367). Indeed, both Open MPI and Intel MPI do not update the schedule[] value if blocking / unlocking calls in the wizard code is ignored. The MPI Standard further recommends (§11.7, p. 366):

Tips for users. The user can write the correct program, following the following rules:

...

lock: Window updates are protected by exclusive locks if they can conflict. Non-conflicting calls (such as read-only access or access accumulation) are protected by shared gateways for both local access and RMA access .

Step 3: Providing the correct MPI_PUT parameters at the beginning

MPI_Put(&schedule[myrank],1,MPI_INT,0,0,1,MPI_INT,win); will pass everything to the first element of the target window. The correct call is provided that the window in the target was created using disp_unit == sizeof(int) :

 int one = 1; MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win); 

The local value of one thus transferred to rank * sizeof(int) bytes after the start of the window in the target. If disp_unit set to 1, the correct one will be:

 MPI_Put(&one, 1, MPI_INT, 0, rank * sizeof(int), 1, MPI_INT, win); 

Step 4: Working with Implementation Features

The above detailed program is bundled with Intel MPI. With Open MPI, special care must be taken. The library is built on the basis of a set of frameworks and implementation modules. The osc frame (one-way communication) comes in two implementations - rdma and pt2pt . By default (in Open MPI 1.6.x and possibly earlier) it is rdma and for some reason it does not perform RMA operations on the target side when calling MPI_WIN_(UN)LOCK , which leads to a deadlock behavior if another call communication ( MPI_BARRIER in your case). The pt2pt module, on the other hand, performs all operations as expected. Therefore, you need to run the program with Open MPI, as shown below, to specifically select the pt2pt component:

 $ mpiexec --mca osc pt2pt ... 

The following is a complete working example of the C99 code:

 #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <mpi.h> // Compares schedule and oldschedule and prints schedule if different // Also displays the time in seconds since the first invocation int fnz (int *schedule, int *oldschedule, int size) { static double starttime = -1.0; int diff = 0; for (int i = 0; i < size; i++) diff |= (schedule[i] != oldschedule[i]); if (diff) { int res = 0; if (starttime < 0.0) starttime = MPI_Wtime(); printf("[%6.3f] Schedule:", MPI_Wtime() - starttime); for (int i = 0; i < size; i++) { printf("\t%d", schedule[i]); res += schedule[i]; oldschedule[i] = schedule[i]; } printf("\n"); return(res == size-1); } return 0; } int main (int argc, char **argv) { MPI_Win win; int rank, size; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { int *oldschedule = malloc(size * sizeof(int)); // Use MPI to allocate memory for the target window int *schedule; MPI_Alloc_mem(size * sizeof(int), MPI_INFO_NULL, &schedule); for (int i = 0; i < size; i++) { schedule[i] = 0; oldschedule[i] = -1; } // Create a window. Set the displacement unit to sizeof(int) to simplify // the addressing at the originator processes MPI_Win_create(schedule, size * sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win); int ready = 0; while (!ready) { // Without the lock/unlock schedule stays forever filled with 0s MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win); ready = fnz(schedule, oldschedule, size); MPI_Win_unlock(0, win); } printf("All workers checked in using RMA\n"); // Release the window MPI_Win_free(&win); // Free the allocated memory MPI_Free_mem(schedule); free(oldschedule); printf("Master done\n"); } else { int one = 1; // Worker processes do not expose memory in the window MPI_Win_create(NULL, 0, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); // Simulate some work based on the rank sleep(2*rank); // Register with the master MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win); MPI_Put(&one, 1, MPI_INT, 0, rank, 1, MPI_INT, win); MPI_Win_unlock(0, win); printf("Worker %d finished RMA\n", rank); // Release the window MPI_Win_free(&win); printf("Worker %d done\n", rank); } MPI_Finalize(); return 0; } 

Example output with 6 processes:

 $ mpiexec --mca osc pt2pt -n 6 rma [ 0.000] Schedule: 0 0 0 0 0 0 [ 1.995] Schedule: 0 1 0 0 0 0 Worker 1 finished RMA [ 3.989] Schedule: 0 1 1 0 0 0 Worker 2 finished RMA [ 5.988] Schedule: 0 1 1 1 0 0 Worker 3 finished RMA [ 7.995] Schedule: 0 1 1 1 1 0 Worker 4 finished RMA [ 9.988] Schedule: 0 1 1 1 1 1 All workers checked in using RMA Worker 5 finished RMA Worker 5 done Worker 4 done Worker 2 done Worker 1 done Worker 3 done Master done 
+8
source

The answer from Hristo Lliev works fine if I use newer versions of the Open-MPI library.

However, in the cluster that we are currently using, this is not possible, and for older versions there is a deadlock behavior for final unlock calls, as described by Hhristo. Adding the --mca osc pt2pt did eliminate the deadlock in a sense, but MPI_Win_unlock calls still didn't execute until the process owning the available variable made its own window lock / unlock. This is not very useful when you have tasks with very different completion times.

Therefore, from a pragmatic point of view, although, strictly speaking, leaving the topic of passive RMA synchronization (for which I apologize), I would like to point out a workaround that uses external files for those who are stuck in using older versions of the Open-MPI library, therefore they don’t need to waste as much time as I did:

Basically, you create an external file containing information about which (subordinate) process is performing this task, and not an internal array. Thus, you don’t even need to have a master process dedicated only to conducting followers: it can also do the work. In any case, each process may look in this file, and the task should be carried out further and, possibly, determine that everything is done.

The important point is that several processes are not accessing this information file at the same time, since this can lead to duplication of work or poor performance. The equivalent of locking and unlocking a window in MPI is most easily simulated using a lock file: this file is created by a process that is currently accessing the information file. Other processes should wait for the completion of the current process, checking with a short time delay whether the lock file exists.

Full details can be found here .

0
source

Source: https://habr.com/ru/post/1501624/