To do this, you need to go through at least one additional step in MPI.
The problem is that the most common of the collection / scatter procedures, MPI_Scatterv and MPI_Gatherv , you can pass the "vector" (v) of counters / offsets, not just one counter for Scatter and Gather, but all types are considered the same. There is no way around this; the memory layouts of each block are different and therefore must be processed by a different type. If there was only one difference between the blocks - some had a different number of columns, or some had a different number of rows - ndash; it’s enough to use different calculations. But with different columns and rows, the counter will not do this; you really need to be able to specify different types.
So what you really want is the often discussed but never implemented MPI_Scatterw (where w means vv, for example, both counters and types are vectors). But this does not exist. The closest you can get is a much more general call to MPI_Alltoallw , which allows you to completely exchange all-for-all data transmission and reception; as the spec specifies, "The MPI_ALLTOALLW function generalizes several MPI functions by carefully choosing the input arguments. For example, by doing all but one of the processes, sendcounts (i) = 0, this reaches MPI_SCATTERW." .
So, you can do this using MPI_Alltoallw, having all the processes different from those that originally had all the data (we assume that rank 0 is here) sent all their send counts to zero. All tasks will also have all their receive counts up to zero, except for the first - the amount of data that they will receive from rank zero.
When sending process samples 0, you first need to determine four different types of types (4 different sizes of subarrays), and then the send counters will be 1, and the only remaining part is to find out the offsets (which, unlike scatterv, are here in units of bytes, because one type cannot be used as a unit):
MPI_Datatype blocktypes[4]; int subsizes[2]; int starts[2] = {0,0}; for (int i=0; i<2; i++) { subsizes[0] = blocksize+i; for (int j=0; j<2; j++) { subsizes[1] = blocksize+j; MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]); MPI_Type_commit(&blocktypes[2*i+j]); } } for (int proc=0; proc<size; proc++) { int row, col; rowcol(proc, blocks, &row, &col); sendcounts[proc] = 1; senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char); int idx = typeIdx(row, col, blocks); sendtypes[proc] = blocktypes[idx]; } } MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes, &(localdata[0][0]), recvcounts, recvdispls, recvtypes, MPI_COMM_WORLD);
And it will work.
But the problem is that the Alltoallw function is so completely general that it’s hard for implementations to do much in the optimization line; therefore, I would be surprised if this were done, as well as the scatter of blocks of the same size.
So another approach is to do something like two stages of communication.
The simplest such approach follows after you notice that you can almost get all the data where he needs to go with a single call to MPI_Scatterv() : in your example, if we work in units of one column vector with column = 1 and rows = 3 (the number of rows in most blocks of a domain), you can scatter almost all global data on other processors. Each of the processors receives 3 or 4 of these vectors, which distributes all the data except the very last row of the global array, which can be processed by a simple second spread. It looks like this:
MPI_Datatype vec, localvec; MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec); MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); MPI_Type_commit(&localvec); MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec); MPI_Type_create_resized(vec, 0, sizeof(char), &vec); MPI_Type_commit(&vec); if (rank == 0) { for (int proc=0; proc<size; proc++) { int row, col; rowcol(proc, blocks, &row, &col); sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize; senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize); } } recvcounts = localsizes[1]; MPI_Scatterv(globalptr, sendcounts, senddispls, vec, &(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD); MPI_Type_free(&localvec); if (rank == 0) MPI_Type_free(&vec); if (rank == 0) { for (int proc=0; proc<size; proc++) { int row, col; rowcol(proc, blocks, &row, &col); sendcounts[proc] = 0; senddispls[proc] = 0; if ( isLastRow(row,blocks) ) { sendcounts[proc] = blocksize; senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize; if ( isLastCol(col,blocks) ) sendcounts[proc] += 1; } } } recvcounts = 0; if ( isLastRow(myrow, blocks) ) { recvcounts = blocksize; if ( isLastCol(mycol, blocks) ) recvcounts++; } MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, &(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);
So far so good. But it’s a shame that most processors are sitting doing nothing during this final “scattering” scrub.
Thus, the best approach is to scatter all the rows in the first phase and scatter this data among the columns of the second phase. Here we create new communicators, and each processor belongs to two new communicators - one of which represents the other processors in the same row of the block, and the other in the same column of the block. At the first stage, the origin processor distributes all the rows of the global array to other processors in the same column communicator, which can be done in one spread. Then these processors, using one scatter and the same column data type as in the previous example, scatter columns on each processor in the same row of the block as it is. The result is two fairly simple scatterings spreading all the data:
MPI_Comm colComm, rowComm; MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm); MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm); if (mycol == 0) { int sendcounts[ blocks[0] ]; int senddispls[ blocks[0] ]; senddispls[0] = 0; for (int row=0; row<blocks[0]; row++) { sendcounts[row] = blocksize*globalsizes[1]; if (row > 0) senddispls[row] = senddispls[row-1] + sendcounts[row-1]; } sendcounts[blocks[0]-1] += globalsizes[1]; rowdata = allocchar2darray( sendcounts[myrow], globalsizes[1] ); MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR, &(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm); } int locnrows = blocksize; if ( isLastRow(myrow, blocks) ) locnrows++; MPI_Datatype vec, localvec; MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec); MPI_Type_create_resized(vec, 0, sizeof(char), &vec); MPI_Type_commit(&vec); MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec); MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec); MPI_Type_commit(&localvec); int sendcounts[ blocks[1] ]; int senddispls[ blocks[1] ]; if (mycol == 0) { for (int col=0; col<blocks[1]; col++) { sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize; senddispls[col] = col*blocksize; } } char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL; MPI_Scatterv(rowptr, sendcounts, senddispls, vec, &(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);
which is simpler and should have a relatively good balance between performance and reliability.
All three of these methods work:
bash-3.2$ mpirun -np 6 ./allmethods alltoall Global array: abcdefg hijklmn opqrstu vwxyzab cdefghi jklmnop qrstuvw xyzabcd efghijk lmnopqr Method - alltoall Rank 0: abc hij opq Rank 1: defg klmn rstu Rank 2: vwx cde jkl Rank 3: yzab fghi mnop Rank 4: qrs xyz efg lmn Rank 5: tuvw abcd hijk opqr bash-3.2$ mpirun -np 6 ./allmethods twophasevecs Global array: abcdefg hijklmn opqrstu vwxyzab cdefghi jklmnop qrstuvw xyzabcd efghijk lmnopqr Method - two phase, vectors, then cleanup Rank 0: abc hij opq Rank 1: defg klmn rstu Rank 2: vwx cde jkl Rank 3: yzab fghi mnop Rank 4: qrs xyz efg lmn Rank 5: tuvw abcd hijk opqr bash-3.2$ mpirun -np 6 ./allmethods twophaserowcol Global array: abcdefg hijklmn opqrstu vwxyzab cdefghi jklmnop qrstuvw xyzabcd efghijk lmnopqr Method - two phase - row, cols Rank 0: abc hij opq Rank 1: defg klmn rstu Rank 2: vwx cde jkl Rank 3: yzab fghi mnop Rank 4: qrs xyz efg lmn Rank 5: tuvw abcd hijk opqr
The following is code that implements these methods; you can set the block sizes for more typical sizes for your problem and run on a realistic number of processors to get an idea of what will be better for your application.
#include <stdio.h>