MPI Alltoallv or better individual Send and Recv? (Representation)

I have several processes (of the order of 100 to 1000), and each of them should send some data to some (for example, about 10) from other processes. (Usually, but not always, if A sends B, B also sends A.) Each process knows how much data it should receive from which process.

So I could just use MPI_Alltoallv , with many or most message lengths being zero. However, I heard that for performance it would be better use multiple MPI_send and MPI_recv messages , rather than global MPI_Alltoallv . I don’t understand: if a series of send and receive calls is more efficient than a single Alltoallv call, why is Alltoallv not only implemented as a series of send and receive calls ?

It would be much more convenient for me (and others?) To use only one global call. In addition, I may need to worry that you will not encounter a deadlock situation with several Send and Recv (corrected by some odd strategy or more complex?) Or using buffered send / recv?).

Do you agree that MPI_Alltoallv needed more slowly than, say, 10 MPI_send and MPI_recv ; and if so, why and how much?

+4
source share
1 answer

Usually the default advice with teams is the opposite: use a collective operation whenever possible, instead of encoding your own. The more information the MPI library has about a communication template, the more possibilities it should optimize internally.

Unless special hardware support is available, collective calls are actually implemented internally from the point of view of sending and receiving. But the actual communication pattern is likely to be more than just a series of sending and receiving. For example, using a tree to transfer part of the data may be faster than sending it to a bunch of receivers with the same rank. Much work is being done to optimize collective communications, and to do it better.

Having said that MPI_Alltoallv slightly different. It is difficult to optimize for all irregular communication scenarios at the MPI level, so it is possible that some kind of custom communication code may improve. For example, the implementation of MPI_Alltoallv can be synchronized: it may be necessary that all processes “check”, even if they should send a message of length 0. I, however, such an implementation is unlikely, but here alone in the wild .

So the real answer is "it depends." If the implementation of the MPI_Alltoallv library is a bad match for the task, the custom link code will win. But before you go this route, check if your MPI-3 colleagues are right for your problem.

+6
source

Source: https://habr.com/ru/post/1447650/


All Articles