I am trying to execute CUDA MPI code to simulate a trellis bolt and have run into difficulties with the MPI_Send and MPI_Recv functions. I confirmed that I have CUDA-enabled MPI with some simple device buffer for send / return code to the device clipboard, so I can send and return arrays between the GPU device memory without going through the CPU / Host.
My code is for a three-dimensional lattice that splits along the z direction between different nodes, with Halos passing between the nodes to provide fluid between these divisions. Halos are on GPUs. Below is the simplification and compilation code giving the same error as my main code. Here the Halo GPU at rank 0 node is MPI_Send () to rank 1 node, which MPI_Recv () is it. My problem seems very simple at the moment, I can not get MPI_Send and MPI_Recv calls! The code does not go to "// CODE DO NOT HERE". lines, which led me to conclude that MPI_etc () calls did not work.
My code basically looks like this: most of the code is deleted, but still sufficient to compile with the same error:
#include <mpi.h> using namespace std; //In declarations: const int DIM_X = 30; const int DIM_Y = 50; const int Q=19; const int NumberDevices = 1; const int NumberNodes = 2; __host__ int SendRecvID(int UpDown, int rank, int Cookie) {int a =(UpDown*NumberNodes*NumberDevices) + (rank*NumberDevices) + Cookie; return a;} //Use as downwards memTrnsfr==0, upwards==1 int main(int argc, char *argv[]) { //MPI functions (copied from online tutorial somewhere) int numprocessors, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocessors); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); /* ...code for splitting other arrays removed... */ size_t size_Halo_z = Q*DIM_X*DIM_Y*sizeof(double); //Size variable used in cudaMalloc and cudaMemcpy. int NumDataPts_f_halo = DIM_X*DIM_Y*Q; //Number of data points used in MPI_Send/Recv calls. MPI_Status status; //Used in MPI_Recv. //Creating arrays for GPU data below, using arrays of pointers: double *Device_HaloUp_Take[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloDown_Take[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloUp_Give[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloDown_Give[NumberDevices]; //Arrays on the GPU which will be the Halos. for(int dev_i=0; dev_i<NumberDevices; dev_i++) //Initialising the GPU arrays: { cudaSetDevice(dev_i); cudaMalloc( (void**)&Device_HaloUp_Take[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloDown_Take[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloUp_Give[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloDown_Give[dev_i], size_Halo_z); } int Cookie=0; //Counter used to count the devices below. for(int n=1;n<=100;n++) //Each loop iteration is one timestep. { /* Run computation on GPUs */ cudaThreadSynchronize(); if(rank==0) //Rank 0 node makes the first MPI_Send(). { for(Cookie=0; Cookie<NumberDevices; Cookie++) { if(NumberDevices==1) //For single GPU codes (which for now is what I am stuck on): { cout << endl << "Testing X " << rank << endl; MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(1,rank,Cookie), MPI_COMM_WORLD); cout << endl << "Testing Y " << rank << endl; //CODE DOES NOT REACH HERE. MPI_Recv(Device_HaloUp_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(0,rank+1,0), MPI_COMM_WORLD, &status); /*etc */ } } } else if(rank==(NumberNodes-1)) { for(Cookie=0; Cookie<NumberDevices; Cookie++) { if(NumberDevices==1) { cout << endl << "Testing A " << rank << endl; MPI_Recv(Device_HaloDown_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank-1), SendRecvID(1,rank-1,NumberDevices-1), MPI_COMM_WORLD, &status); cout << endl << "Testing B " << rank << endl; //CODE DOES NOT REACH HERE. MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, 0, SendRecvID(1,rank,Cookie), MPI_COMM_WORLD); /*etc*/ } } } } /* Then some code to carry out rest of lattice boltzmann method. */ MPI_Finalize(); }
Since I have 2 nodes (NumberNodes == 2 variable in the code), I have one rank == 0 and the other as rank == 1 == NumberNodes-1. The code of rank 0 goes into the if (rank == 0) loop, where it outputs "Testing X 0", but never gets the output "Testing Y 0", because it is interrupted in advance in the MPI_Send () function. The cookie variable at this point is 0, because there is only one GPU / device, so the SendRecvID () function takes the value "(1,0,0)". The first parameter of MPI_Send is a pointer, since Device_Halo_etc is an array of pointers, while the place where data is sent is (rank + 1) = 1.
Similarly, rank 1 code goes into the if (rank == NumberNodes-1) loop, where it gives βTest A 1β but not βTest B 1β when the code stops before the MPI_Recv call ends. As far as I can tell, the MPI_Recv parameters are correct, because (rank-1) = 0 is correct, the number of transmitted and received data points is correct, and the identifier is the same.
What I have tried so far is to make sure that each one has the same tag (although in each case the value of SendRecvID () in each case takes (1,0,0), also like everyone else) by manually recording 999 or so, but that didn't matter. I also changed the Device_Halo_etc parameter to Device_Halo_etc in both MPI calls, just in case I messed up with pointers, but also no difference. The only way to get it working so far is to change the Device_Halo_etc parameters in the MPI_Send / Recv () call to be arbitrary arrays on the host to check if they are portable, which allows it to get the first MPI call and, of course, stuck on the next one but even this only works when changing the number of variables on Send / Recv to 1 (NumDataPts_f_halo == 14250 instead). And, of course, moving host arrays around is not of interest.
Running code using an nvcc compiler with additional binding variables (I'm not too sure how they work by copying a method on the Internet somewhere, but given that a simpler device works for calling an MPI device, I don't see a problem with this ), through:
nvcc TestingMPI.cu -o run_Test -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl
and compiling with:
mpirun -np 2 run_Test
This results in an error that usually looks like this:
Testing A 1 Testing X 0 [Anastasia:16671] *** Process received signal *** [Anastasia:16671] Signal: Segmentation fault (11) [Anastasia:16671] Signal code: Invalid permissions (2) [Anastasia:16671] Failing at address: 0x700140000 [Anastasia:16671] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f20327774a0] [Anastasia:16671] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x147fe5) [0x7f2032888fe5] [Anastasia:16671] [ 2] /usr/lib/libmpi.so.1(opal_convertor_pack+0x14d) [0x7f20331303bd] [Anastasia:16671] [ 3] /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x20c8) [0x7f202cad20c8] [Anastasia:16671] [ 4] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x100f0) [0x7f202d9430f0] [Anastasia:16671] [ 5] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x772b) [0x7f202d93a72b] [Anastasia:16671] [ 6] /usr/lib/libmpi.so.1(MPI_Send+0x17b) [0x7f20330bc57b] [Anastasia:16671] [ 7] run_Test() [0x400ff7] [Anastasia:16671] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f203276276d] [Anastasia:16671] [ 9] run_Test() [0x400ce9] [Anastasia:16671] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 16671 on node Anastasia exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------
I run the code on my laptop (Anastasia), Lenovo Y500 with two GT650m NVIDIA graphics cards running on Linux Ubuntu 12.04LTS, if that matters. nvcc --version gives "release 5.0, V0.2.1221" and mpirun --version gives "mpirun (Open MPI) 1.5.4".