Sending 2D Array Blocks in C Using MPI

Question

Sending 2D Array Blocks in C Using MPI

How do you send blocks of a two-dimensional array to different processors? Suppose the size of a 2D array is 400x400, and I want to send blocks of size 100X100 to different processors. The idea is that each processor will perform calculations on its own separate block and send the result back to the first processor for the final result.
I use MPI in C programs.

+42

c parallel-processing mpi

Sundevil Feb 13 2018-12-12T00:

source share

2 answers

Jonathan Dursi · Answer 1 · 2012-02-14 04:34

Let me start by saying that you don’t want to do this at all - scatter and collect huge chunks of data from some kind of “master process”. As a rule, you want each task to be distracted by its own piece of the puzzle, and you should strive to ensure that no processor needs a “global view” of all data; as soon as you require it, you will limit the scalability and size of the problem. If you do this for I / O - one process reads the data, then scatters it, and then collects it back for writing, you need to end up looking at MPI-IO.

However, if you have a question, MPI has very good ways to extract arbitrary data from memory, as well as scatter / collect it from a set of processors. Unfortunately, this requires a significant number of MPI concepts - MPI types, extents, and collective operations. In response to this question, many basic ideas are discussed - MPI_Type_create_subarray and MPI_Gather .

Update . In the cold daylight, this is a lot of code, not a lot of explanation. So let me expand a bit.

Consider a 1d integer global array, for task 0 of which you want to distribute several MPI tasks so that each of them gets a piece in its local array. Let's say you have 4 tasks, and the global array is [01234567] . You may have task 0 to send four messages (including one to yourself) to distribute this, and when the time comes for reassembly, you will receive four messages to tie them together; but it obviously takes a lot of time with a lot of processes. There are optimized procedures for these types of operations - spreading / collecting operations. So in this case you are doing something like this:

 int global[8]; /* only task 0 has this */ int local[2]; /* everyone has this */ const int root = 0; /* the processor with the initial global data */ if (rank == root) { for (int i=0; i<7; i++) global[i] = i; } MPI_Scatter(global, 2, MPI_INT, /* send everyone 2 ints from global */ local, 2, MPI_INT, /* each proc receives 2 ints into local */ root, MPI_COMM_WORLD); /* sending process is root, all procs in */ /* MPI_COMM_WORLD participate */

After that, the processor data will look like

 task 0: local:[01] global: [01234567] task 1: local:[23] global: [garbage-] task 2: local:[45] global: [garbage-] task 3: local:[67] global: [garbage-]

Thus, the scattering operation takes a global array and sends adjacent 2-integer fragments to all processors.

To reassemble the array, we use the MPI_Gather() operation, which works exactly the same, but vice versa:

 for (int i=0; i<2; i++) local[i] = local[i] + rank; MPI_Gather(local, 2, MPI_INT, /* everyone sends 2 ints from local */ global, 2, MPI_INT, /* root receives 2 ints each proc into global */ root, MPI_COMM_WORLD); /* recv'ing process is root, all procs in */ /* MPI_COMM_WORLD participate */

and now the data looks like

 task 0: local:[01] global: [0134679a] task 1: local:[34] global: [garbage-] task 2: local:[67] global: [garbage-] task 3: local:[9a] global: [garbage-]

Gather returns all the data, but here 10, because I did not think that my formatting was thorough enough when running this example.

What happens if the number of data points does not evenly divide the number of processes and we need to send different numbers of elements to each process? Then you need a generalized version of the scatter, MPI_Scatterv() , which allows you to specify the counts for each processor and offsets - where in the global array this piece of data begins. So, let's say you had an array of characters [abcdefghi] with 9 characters, and you were going to assign two characters to each process, except for the last one that got three. Then you will need

 char global[9]; /* only task 0 has this */ char local[3]={'-','-','-'}; /* everyone has this */ int mynum; /* how many items */ const int root = 0; /* the processor with the initial global data */ if (rank == 0) { for (int i=0; i<8; i++) global[i] = 'a'+i; } int counts[4] = {2,2,2,3}; /* how many pieces of data everyone has */ mynum = counts[rank]; int displs[4] = {0,2,4,6}; /* the starting point of everyone data */ /* in the global array */ MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] pts from displs[i] */ MPI_INT, local, mynum, MPI_INT; /* I'm receiving mynum MPI_INTs into local */ root, MPI_COMM_WORLD);

Now the data looks like

 task 0: local:[ab-] global: [abcdefghi] task 1: local:[cd-] global: [garbage--] task 2: local:[ef-] global: [garbage--] task 3: local:[ghi] global: [garbage--]

You have now used dispav to distribute irregular amounts of data. In each case, the offset is equal to two * ranks (measured in characters, the offset is in units of types sent for scatter or accepted for collection, this is not usually in bytes or something else) from the beginning of the array, and the counts {2,2,2 , 3}. If this were the first processor, we would like to have 3 characters, we would set counts = {3,2,2,2}, and the offsets would be {0,3,5,7}. Gatherv works again exactly the same way and vice versa; arrays of counts and pls will remain unchanged.

Now, for 2D, it's a little trickier. If we want to send 2d sublayers of the 2nd array, the data that we are sending now is no longer adjacent. If we send (say) 3 × 3 subblocks of a 6x6 array to 4 processors, the data we send has holes in it:

 2D Array --------- |000|111| |000|111| |000|111| |---+---| |222|333| |222|333| |222|333| --------- Actual layout in memory [000111000111000111222333222333222333]

(Note that all high-performance computing comes down to understanding the layout of data in memory.)

If we want to send the data marked “1” to task 1, we need to skip three values, send three values, skip three values, send three values, skip three values, send three values. The second complication is that the subregions stop and start; note that region "1" does not start when region "0" stops; after the last element of region "0", the next memory location is part of the path through region "1".

First, consider the first layout problem - how to pull out only the data that we want to send. We could always just copy all the data of the region “0” to another adjacent array and send it; if we planned it carefully enough, we could make it so that we can call MPI_Scatter results. But we would not have to transfer our entire core data structure in this way.

So far, all the MPI data types that we used have been simple - MPI_INT indicates (say) 4 bytes per line. However, MPI allows you to create your own data types that describe arbitrarily complex data layouts in memory. And this case - the rectangular subregions of the array - is common enough that there is a specific challenge for this. For the two-dimensional case, which we describe above,

  MPI_Datatype newtype; int sizes[2] = {6,6}; /* size of global array */ int subsizes[2] = {3,3}; /* size of sub-region */ int starts[2] = {0,0}; /* let say we're looking at region "0", which begins at index [0,0] */ MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &newtype); MPI_Type_commit(&newtype);

This creates a type that selects only the "0" region from the global array; we could send only this piece of data to another processor

  MPI_Send(&(global[0][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "0" */

and the receiving process can get it into a local array. Note that the receiving process, if it only accepts it in a 3x3 array, cannot describe what it receives as a newtype type; which no longer describes the memory layout. Instead, it just gets a block of 3 * 3 = 9 integers:

  MPI_Recv(&(local[0][0]), 3*3, MPI_INT, 0, tag, MPI_COMM_WORLD);

Note that we could do this for other subregions, either by creating a different type (with a different start array) for other blocks, or simply sending at the starting point of a specific block:

  MPI_Send(&(global[0][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "1" */ MPI_Send(&(global[3][0]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "2" */ MPI_Send(&(global[3][3]), 1, newtype, dest, tag, MPI_COMM_WORLD); /* region "3" */

Finally, note that we require that global and local be contiguous chunks of memory here; that is, &(global[0][0]) and &(local[0][0]) (or, equivalently, *global and *local indicate adjacent 6 * 6 and 3 * 3 chunks of memory, which is not guaranteed the usual way to isolate dynamic multi-d. He showed how to do this below.

Now that we understand how to specify subregions, one more thing needs to be discussed before discussing scatter / gather operations and only the “size” of these types. We could not just use MPI_Scatter() (or even scattering) with these types yet, since these types are 16 integers long; that is, where they end, these are 16 integers after they start - and where they end, they don’t line up well, where the next block starts, so we can’t just use the scatter - he would choose the wrong place to start sending data to the next to the processor.

Of course, we could use MPI_Scatterv() and determine the movements ourselves, and what we will do, except for offsets in units of the size of the type to send, and this also does not help; blocks begin with offsets (0,3,18,21) of integers from the beginning of the global array, and the fact that the block ends 16 integers from where it starts does not allow us to express these offsets in integer multiple values in general.

To handle this, MPI allows you to set the type size for the purpose of this calculation. It does not truncate a type; it is simply used to find out where the next element begins with the last element. For types such as these with holes in them, it is often useful to set the degree to be something less than the distance in memory to the actual end of the type.

We can establish how much it will be something convenient for us. We could just make size 1 an integer, and then set the offsets in units of integers. In this case, however, I like to point out that the degree is 3 integers - the size of the substring - thus, block "1" begins immediately after block "0", and block "3" begins immediately after block "2". Unfortunately, this does not work very well when you jump from block “2” to block “3”, but this cannot help.

To disperse the subunits in this case, we would do the following:

  MPI_Datatype type, resizedtype; int sizes[2] = {6,6}; /* size of global array */ int subsizes[2] = {3,3}; /* size of sub-region */ int starts[2] = {0,0}; /* let say we're looking at region "0", which begins at index [0,0] */ /* as before */ MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type); /* change the extent of the type */ MPI_Type_create_resized(type, 0, 3*sizeof(int), &resizedtype); MPI_Type_commit(&resizedtype);

Here we created the same block type as before, but we resized it; we did not change where the type "begins" (0), but we changed its "completion" (3 intervals). We have not mentioned this before, but MPI_Type_commit should be able to use this type; but you only need to fix the final type that you are using, not any intermediate steps. You use MPI_Type_free to free the type when you're done.

So, now, finally, we can scatter the blocks: manipulating the data above is a bit complicated, but once this is done, the scattering looks the same as before:

 int counts[4] = {1,1,1,1}; /* how many pieces of data everyone has, in units of blocks */ int displs[4] = {0,1,6,7}; /* the starting point of everyone data */ /* in the global array, in block extents */ MPI_Scatterv(global, counts, displs, /* proc i gets counts[i] types from displs[i] */ resizedtype, local, 3*3, MPI_INT; /* I'm receiving 3*3 MPI_INTs into local */ root, MPI_COMM_WORLD);

And now we are done, after a short tour of the scatter, collection, and derived types of MPI.

The following is an example code that shows both the collection operation and the scatter operation with arrays of characters. Launching the program:

 $ mpirun -n 4 ./gathervarray Global array is: 0123456789 3456789012 6789012345 9012345678 2345678901 5678901234 8901234567 1234567890 4567890123 7890123456 Local process on rank 0 is: |01234| |34567| |67890| |90123| |23456| Local process on rank 1 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 2 is: |56789| |89012| |12345| |45678| |78901| Local process on rank 3 is: |01234| |34567| |67890| |90123| |23456| Processed grid: AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB AAAAABBBBB CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD CCCCCDDDDD

and the code follows.

 #include <stdio.h> #include <math.h> #include <stdlib.h> #include "mpi.h" int malloc2dchar(char ***array, int n, int m) { /* allocate the n*m contiguous items */ char *p = (char *)malloc(n*m*sizeof(char)); if (!p) return -1; /* allocate the row pointers into the memory */ (*array) = (char **)malloc(n*sizeof(char*)); if (!(*array)) { free(p); return -1; } /* set up the pointers into the contiguous memory */ for (int i=0; i<n; i++) (*array)[i] = &(p[i*m]); return 0; } int free2dchar(char ***array) { /* free the memory - the first element of the array is at the start */ free(&((*array)[0][0])); /* free the pointers into the memory */ free(*array); return 0; } int main(int argc, char **argv) { char **global, **local; const int gridsize=10; // size of grid const int procgridsize=2; // size of process grid int rank, size; // rank of current process and no. of processes MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size != procgridsize*procgridsize) { fprintf(stderr,"%s: Only works with np=%d for now\n", argv[0], procgridsize); MPI_Abort(MPI_COMM_WORLD,1); } if (rank == 0) { /* fill in the array, and print it */ malloc2dchar(&global, gridsize, gridsize); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) global[i][j] = '0'+(3*i+j)%10; } printf("Global array is:\n"); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) putchar(global[i][j]); printf("\n"); } } /* create the local array which we'll process */ malloc2dchar(&local, gridsize/procgridsize, gridsize/procgridsize); /* create a datatype to describe the subarrays of the global array */ int sizes[2] = {gridsize, gridsize}; /* global size */ int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */ int starts[2] = {0,0}; /* where this one starts */ MPI_Datatype type, subarrtype; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &type); MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(char), &subarrtype); MPI_Type_commit(&subarrtype); char *globalptr=NULL; if (rank == 0) globalptr = &(global[0][0]); /* scatter the array to all processors */ int sendcounts[procgridsize*procgridsize]; int displs[procgridsize*procgridsize]; if (rank == 0) { for (int i=0; i<procgridsize*procgridsize; i++) sendcounts[i] = 1; int disp = 0; for (int i=0; i<procgridsize; i++) { for (int j=0; j<procgridsize; j++) { displs[i*procgridsize+j] = disp; disp += 1; } disp += ((gridsize/procgridsize)-1)*procgridsize; } } MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR, 0, MPI_COMM_WORLD); /* now all processors print their local data: */ for (int p=0; p<size; p++) { if (rank == p) { printf("Local process on rank %d is:\n", rank); for (int i=0; i<gridsize/procgridsize; i++) { putchar('|'); for (int j=0; j<gridsize/procgridsize; j++) { putchar(local[i][j]); } printf("|\n"); } } MPI_Barrier(MPI_COMM_WORLD); } /* now each processor has its local array, and can process it */ for (int i=0; i<gridsize/procgridsize; i++) { for (int j=0; j<gridsize/procgridsize; j++) { local[i][j] = 'A' + rank; } } /* it all goes back to process 0 */ MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_CHAR, globalptr, sendcounts, displs, subarrtype, 0, MPI_COMM_WORLD); /* don't need the local data anymore */ free2dchar(&local); /* or the MPI data type */ MPI_Type_free(&subarrtype); if (rank == 0) { printf("Processed grid:\n"); for (int i=0; i<gridsize; i++) { for (int j=0; j<gridsize; j++) { putchar(global[i][j]); } printf("\n"); } free2dchar(&global); } MPI_Finalize(); return 0; }

gsamaras · Answer 2 · 2015-12-30 11:57

It was just easier for me to verify this.

 #include <stdio.h> #include <math.h> #include <stdlib.h> #include "mpi.h" /* This is a version with integers, rather than char arrays, presented in this very good answer: http://stackoverflow.com/a/9271753/2411320 It will initialize the 2D array, scatter it, increase every value by 1 and then gather it back. */ int malloc2D(int ***array, int n, int m) { int i; /* allocate the n*m contiguous items */ int *p = malloc(n*m*sizeof(int)); if (!p) return -1; /* allocate the row pointers into the memory */ (*array) = malloc(n*sizeof(int*)); if (!(*array)) { free(p); return -1; } /* set up the pointers into the contiguous memory */ for (i=0; i<n; i++) (*array)[i] = &(p[i*m]); return 0; } int free2D(int ***array) { /* free the memory - the first element of the array is at the start */ free(&((*array)[0][0])); /* free the pointers into the memory */ free(*array); return 0; } int main(int argc, char **argv) { int **global, **local; const int gridsize=4; // size of grid const int procgridsize=2; // size of process grid int rank, size; // rank of current process and no. of processes int i, j, p; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (size != procgridsize*procgridsize) { fprintf(stderr,"%s: Only works with np=%d for now\n", argv[0], procgridsize); MPI_Abort(MPI_COMM_WORLD,1); } if (rank == 0) { /* fill in the array, and print it */ malloc2D(&global, gridsize, gridsize); int counter = 0; for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) global[i][j] = ++counter; } printf("Global array is:\n"); for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) { printf("%2d ", global[i][j]); } printf("\n"); } } //return; /* create the local array which we'll process */ malloc2D(&local, gridsize/procgridsize, gridsize/procgridsize); /* create a datatype to describe the subarrays of the global array */ int sizes[2] = {gridsize, gridsize}; /* global size */ int subsizes[2] = {gridsize/procgridsize, gridsize/procgridsize}; /* local size */ int starts[2] = {0,0}; /* where this one starts */ MPI_Datatype type, subarrtype; MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_INT, &type); MPI_Type_create_resized(type, 0, gridsize/procgridsize*sizeof(int), &subarrtype); MPI_Type_commit(&subarrtype); int *globalptr=NULL; if (rank == 0) globalptr = &(global[0][0]); /* scatter the array to all processors */ int sendcounts[procgridsize*procgridsize]; int displs[procgridsize*procgridsize]; if (rank == 0) { for (i=0; i<procgridsize*procgridsize; i++) sendcounts[i] = 1; int disp = 0; for (i=0; i<procgridsize; i++) { for (j=0; j<procgridsize; j++) { displs[i*procgridsize+j] = disp; disp += 1; } disp += ((gridsize/procgridsize)-1)*procgridsize; } } MPI_Scatterv(globalptr, sendcounts, displs, subarrtype, &(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_INT, 0, MPI_COMM_WORLD); /* now all processors print their local data: */ for (p=0; p<size; p++) { if (rank == p) { printf("Local process on rank %d is:\n", rank); for (i=0; i<gridsize/procgridsize; i++) { putchar('|'); for (j=0; j<gridsize/procgridsize; j++) { printf("%2d ", local[i][j]); } printf("|\n"); } } MPI_Barrier(MPI_COMM_WORLD); } /* now each processor has its local array, and can process it */ for (i=0; i<gridsize/procgridsize; i++) { for (j=0; j<gridsize/procgridsize; j++) { local[i][j] += 1; // increase by one the value } } /* it all goes back to process 0 */ MPI_Gatherv(&(local[0][0]), gridsize*gridsize/(procgridsize*procgridsize), MPI_INT, globalptr, sendcounts, displs, subarrtype, 0, MPI_COMM_WORLD); /* don't need the local data anymore */ free2D(&local); /* or the MPI data type */ MPI_Type_free(&subarrtype); if (rank == 0) { printf("Processed grid:\n"); for (i=0; i<gridsize; i++) { for (j=0; j<gridsize; j++) { printf("%2d ", global[i][j]); } printf("\n"); } free2D(&global); } MPI_Finalize(); return 0; }

Output:

 linux16:>mpicc -o main main.c linux16:>mpiexec -n 4 main Global array is: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Local process on rank 0 is: | 1 2 | | 5 6 | Local process on rank 1 is: | 3 4 | | 7 8 | Local process on rank 2 is: | 9 10 | |13 14 | Local process on rank 3 is: |11 12 | |15 16 | Processed grid: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Sending 2D Array Blocks in C Using MPI

More articles: