Let me start by saying that you don’t want to do this at all - scatter and collect huge chunks of data from some kind of “master process”. As a rule, you want each task to be distracted by its own piece of the puzzle, and you should strive to ensure that no processor needs a “global view” of all data; as soon as you require it, you will limit the scalability and size of the problem. If you do this for I / O - one process reads the data, then scatters it, and then collects it back for writing, you need to end up looking at MPI-IO.
However, if you have a question, MPI has very good ways to extract arbitrary data from memory, as well as scatter / collect it from a set of processors. Unfortunately, this requires a significant number of MPI concepts - MPI types, extents, and collective operations. In response to this question, many basic ideas are discussed - MPI_Type_create_subarray and MPI_Gather .
Update . In the cold daylight, this is a lot of code, not a lot of explanation. So let me expand a bit.
Consider a 1d integer global array, for task 0 of which you want to distribute several MPI tasks so that each of them gets a piece in its local array. Let's say you have 4 tasks, and the global array is [01234567] . You may have task 0 to send four messages (including one to yourself) to distribute this, and when the time comes for reassembly, you will receive four messages to tie them together; but it obviously takes a lot of time with a lot of processes. There are optimized procedures for these types of operations - spreading / collecting operations. So in this case you are doing something like this:
int global[8]; int local[2]; const int root = 0; if (rank == root) { for (int i=0; i<7; i++) global[i] = i; } MPI_Scatter(global, 2, MPI_INT, local, 2, MPI_INT, root, MPI_COMM_WORLD);
After that, the processor data will look like
task 0: local:[01] global: [01234567] task 1: local:[23] global: [garbage-] task 2: local:[45] global: [garbage-] task 3: local:[67] global: [garbage-]
Thus, the scattering operation takes a global array and sends adjacent 2-integer fragments to all processors.
To reassemble the array, we use the MPI_Gather() operation, which works exactly the same, but vice versa:
for (int i=0; i<2; i++) local[i] = local[i] + rank; MPI_Gather(local, 2, MPI_INT, global, 2, MPI_INT, root, MPI_COMM_WORLD);
and now the data looks like
task 0: local:[01] global: [0134679a] task 1: local:[34] global: [garbage-] task 2: local:[67] global: [garbage-] task 3: local:[9a] global: [garbage-]
Gather returns all the data, but here 10, because I did not think that my formatting was thorough enough when running this example.
What happens if the number of data points does not evenly divide the number of processes and we need to send different numbers of elements to each process? Then you need a generalized version of the scatter, MPI_Scatterv() , which allows you to specify the counts for each processor and offsets - where in the global array this piece of data begins. So, let's say you had an array of characters [abcdefghi] with 9 characters, and you were going to assign two characters to each process, except for the last one that got three. Then you will need
char global[9]; char local[3]={'-','-','-'}; int mynum; const int root = 0; if (rank == 0) { for (int i=0; i<8; i++) global[i] = 'a'+i; } int counts[4] = {2,2,2,3}; mynum = counts[rank]; int displs[4] = {0,2,4,6}; MPI_Scatterv(global, counts, displs, MPI_INT, local, mynum, MPI_INT; root, MPI_COMM_WORLD);
Now the data looks like
task 0: local:[ab-] global: [abcdefghi] task 1: local:[cd-] global: [garbage--] task 2: local:[ef-] global: [garbage--] task 3: local:[ghi] global: [garbage--]
You have now used dispav to distribute irregular amounts of data. In each case, the offset is equal to two * ranks (measured in characters, the offset is in units of types sent for scatter or accepted for collection, this is not usually in bytes or something else) from the beginning of the array, and the counts {2,2,2 , 3}. If this were the first processor, we would like to have 3 characters, we would set counts = {3,2,2,2}, and the offsets would be {0,3,5,7}. Gatherv works again exactly the same way and vice versa; arrays of counts and pls will remain unchanged.
Now, for 2D, it's a little trickier. If we want to send 2d sublayers of the 2nd array, the data that we are sending now is no longer adjacent. If we send (say) 3 × 3 subblocks of a 6x6 array to 4 processors, the data we send has holes in it:
2D Array --------- |000|111| |000|111| |000|111| |---+---| |222|333| |222|333| |222|333| --------- Actual layout in memory [000111000111000111222333222333222333]
(Note that all high-performance computing comes down to understanding the layout of data in memory.)
If we want to send the data marked “1” to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. The second complication is that the subregions stop and start; note that region "1" does not start when region "0" stops; after the last element of region "0", the next memory location is part of the path through region "1".
First, consider the first layout problem - how to pull out only the data that we want to send. We could always just copy all the data of the region “0” to another adjacent array and send it; if we planned it carefully enough, we could make it so that we can call MPI_Scatter results. But we would not have to transfer our entire core data structure in this way.
So far, all the MPI data types that we used have been simple - MPI_INT indicates (say) 4 bytes per line. However, MPI allows you to create your own data types that describe arbitrarily complex data layouts in memory. And this case - the rectangular subregions of the array - is common enough that there is a specific challenge for this. For the two-dimensional case, which we describe above,
MPI_Datatype newtype; int sizes[2] = {6,6}; int subsizes[2] = {3,3}; int starts[2] = {0,0}; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &newtype); MPI_Type_commit(&newtype);
This creates a type that selects only the "0" region from the global array; we could send only this piece of data to another processor
MPI_Send(&(global[0][0]), 1, newtype, dest, tag, MPI_COMM_WORLD);
and the receiving process can get it into a local array. Note that the receiving process, if it only accepts it in a 3x3 array, cannot describe what it receives as a newtype type; which no longer describes the memory layout. Instead, it just gets a block of 3 * 3 = 9 integers:
MPI_Recv(&(local[0][0]), 3*3, MPI_INT, 0, tag, MPI_COMM_WORLD);
Note that we could do this for other subregions, either by creating a different type (with a different start array) for other blocks, or simply sending at the starting point of a specific block:
MPI_Send(&(global[0][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "1" */ MPI_Send(&(global[3][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "2" */ MPI_Send(&(global[3][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "3" */
Finally, note that we require that global and local be contiguous chunks of memory here; that is, &(global[0][0]) and &(local[0][0]) (or, equivalently, *global and *local indicate adjacent 6 * 6 and 3 * 3 chunks of memory, which is not guaranteed the usual way to isolate dynamic multi-d. He showed how to do this below.
Now that we understand how to specify subregions, one more thing needs to be discussed before discussing scatter / gather operations and only the “size” of these types. We could not just use MPI_Scatter() (or even scattering) with these types yet, since these types are 16 integers long; that is, where they end, these are 16 integers after they start - and where they end, they don’t line up well, where the next block starts, so we can’t just use the scatter - he would choose the wrong place to start sending data to the next to the processor.
Of course, we could use MPI_Scatterv() and determine the movements ourselves, and what we will do, except for offsets in units of the size of the type to send, and this also does not help; blocks begin with offsets (0,3,18,21) of integers from the beginning of the global array, and the fact that the block ends 16 integers from where it starts does not allow us to express these offsets in integer multiple values in general.
To handle this, MPI allows you to set the type size for the purpose of this calculation. It does not truncate a type; it is simply used to find out where the next element begins with the last element. For types such as these with holes in them, it is often useful to set the degree to be something less than the distance in memory to the actual end of the type.
We can establish how much it will be something convenient for us. We could just make size 1 an integer, and then set the offsets in units of integers. In this case, however, I like to point out that the degree is 3 integers - the size of the substring - thus, block "1" begins immediately after block "0", and block "3" begins immediately after block "2". Unfortunately, this does not work very well when you jump from block “2” to block “3”, but this cannot help.
To disperse the subunits in this case, we would do the following:
MPI_Datatype type, resizedtype; int sizes[2] = {6,6}; int subsizes[2] = {3,3}; int starts[2] = {0,0}; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type); MPI_Type_create_resized(type, 0, 3*sizeof(int), &resizedtype); MPI_Type_commit(&resizedtype);
Here we created the same block type as before, but we resized it; we did not change where the type "begins" (0), but we changed its "completion" (3 intervals). We have not mentioned this before, but MPI_Type_commit should be able to use this type; but you only need to fix the final type that you are using, not any intermediate steps. You use MPI_Type_free to free the type when you're done.
So, now, finally, we can scatter the blocks: manipulating the data above is a bit complicated, but once this is done, the scattering looks the same as before:
int counts[4] = {1,1,1,1}; int displs[4] = {0,1,6,7}; MPI_Scatterv(global, counts, displs, resizedtype, local, 3*3, MPI_INT; root, MPI_COMM_WORLD);
And now we are done, after a short tour of the scatter, collection, and derived types of MPI.
The following is an example code that shows both the collection operation and the scatter operation with arrays of characters. Launching the program:
$ mpirun -n 4 ./gathervarray Global array is: 0123456789 3456789012 6789012345 9012345678 2345678901 5678901234 8901234567 1234567890 4567890123 7890123456 Local process on rank 0 is: |01234| |34567| |67890| |90123| |23456| Local process on rank 1 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 2 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 3 is: |01234| |34567| |67890| |90123| |23456| Processed grid: AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD
and the code follows.
#include <stdio.h>