Simple MPI_Send and Recv throw out a segmentation error (11) and invalid resolution (2) with CUDA

I am trying to execute CUDA MPI code to simulate a trellis bolt and have run into difficulties with the MPI_Send and MPI_Recv functions. I confirmed that I have CUDA-enabled MPI with some simple device buffer for send / return code to the device clipboard, so I can send and return arrays between the GPU device memory without going through the CPU / Host.

My code is for a three-dimensional lattice that splits along the z direction between different nodes, with Halos passing between the nodes to provide fluid between these divisions. Halos are on GPUs. Below is the simplification and compilation code giving the same error as my main code. Here the Halo GPU at rank 0 node is MPI_Send () to rank 1 node, which MPI_Recv () is it. My problem seems very simple at the moment, I can not get MPI_Send and MPI_Recv calls! The code does not go to "// CODE DO NOT HERE". lines, which led me to conclude that MPI_etc () calls did not work.

My code basically looks like this: most of the code is deleted, but still sufficient to compile with the same error:

#include <mpi.h> using namespace std; //In declarations: const int DIM_X = 30; const int DIM_Y = 50; const int Q=19; const int NumberDevices = 1; const int NumberNodes = 2; __host__ int SendRecvID(int UpDown, int rank, int Cookie) {int a =(UpDown*NumberNodes*NumberDevices) + (rank*NumberDevices) + Cookie; return a;} //Use as downwards memTrnsfr==0, upwards==1 int main(int argc, char *argv[]) { //MPI functions (copied from online tutorial somewhere) int numprocessors, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocessors); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); /* ...code for splitting other arrays removed... */ size_t size_Halo_z = Q*DIM_X*DIM_Y*sizeof(double); //Size variable used in cudaMalloc and cudaMemcpy. int NumDataPts_f_halo = DIM_X*DIM_Y*Q; //Number of data points used in MPI_Send/Recv calls. MPI_Status status; //Used in MPI_Recv. //Creating arrays for GPU data below, using arrays of pointers: double *Device_HaloUp_Take[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloDown_Take[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloUp_Give[NumberDevices]; //Arrays on the GPU which will be the Halos. double *Device_HaloDown_Give[NumberDevices]; //Arrays on the GPU which will be the Halos. for(int dev_i=0; dev_i<NumberDevices; dev_i++) //Initialising the GPU arrays: { cudaSetDevice(dev_i); cudaMalloc( (void**)&Device_HaloUp_Take[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloDown_Take[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloUp_Give[dev_i], size_Halo_z); cudaMalloc( (void**)&Device_HaloDown_Give[dev_i], size_Halo_z); } int Cookie=0; //Counter used to count the devices below. for(int n=1;n<=100;n++) //Each loop iteration is one timestep. { /* Run computation on GPUs */ cudaThreadSynchronize(); if(rank==0) //Rank 0 node makes the first MPI_Send(). { for(Cookie=0; Cookie<NumberDevices; Cookie++) { if(NumberDevices==1) //For single GPU codes (which for now is what I am stuck on): { cout << endl << "Testing X " << rank << endl; MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(1,rank,Cookie), MPI_COMM_WORLD); cout << endl << "Testing Y " << rank << endl; //CODE DOES NOT REACH HERE. MPI_Recv(Device_HaloUp_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank+1), SendRecvID(0,rank+1,0), MPI_COMM_WORLD, &status); /*etc */ } } } else if(rank==(NumberNodes-1)) { for(Cookie=0; Cookie<NumberDevices; Cookie++) { if(NumberDevices==1) { cout << endl << "Testing A " << rank << endl; MPI_Recv(Device_HaloDown_Give[Cookie], NumDataPts_f_halo, MPI_DOUBLE, (rank-1), SendRecvID(1,rank-1,NumberDevices-1), MPI_COMM_WORLD, &status); cout << endl << "Testing B " << rank << endl; //CODE DOES NOT REACH HERE. MPI_Send(Device_HaloUp_Take[Cookie], NumDataPts_f_halo, MPI_DOUBLE, 0, SendRecvID(1,rank,Cookie), MPI_COMM_WORLD); /*etc*/ } } } } /* Then some code to carry out rest of lattice boltzmann method. */ MPI_Finalize(); } 

Since I have 2 nodes (NumberNodes == 2 variable in the code), I have one rank == 0 and the other as rank == 1 == NumberNodes-1. The code of rank 0 goes into the if (rank == 0) loop, where it outputs "Testing X 0", but never gets the output "Testing Y 0", because it is interrupted in advance in the MPI_Send () function. The cookie variable at this point is 0, because there is only one GPU / device, so the SendRecvID () function takes the value "(1,0,0)". The first parameter of MPI_Send is a pointer, since Device_Halo_etc is an array of pointers, while the place where data is sent is (rank + 1) = 1.

Similarly, rank 1 code goes into the if (rank == NumberNodes-1) loop, where it gives β€œTest A 1” but not β€œTest B 1” when the code stops before the MPI_Recv call ends. As far as I can tell, the MPI_Recv parameters are correct, because (rank-1) = 0 is correct, the number of transmitted and received data points is correct, and the identifier is the same.

What I have tried so far is to make sure that each one has the same tag (although in each case the value of SendRecvID () in each case takes (1,0,0), also like everyone else) by manually recording 999 or so, but that didn't matter. I also changed the Device_Halo_etc parameter to Device_Halo_etc in both MPI calls, just in case I messed up with pointers, but also no difference. The only way to get it working so far is to change the Device_Halo_etc parameters in the MPI_Send / Recv () call to be arbitrary arrays on the host to check if they are portable, which allows it to get the first MPI call and, of course, stuck on the next one but even this only works when changing the number of variables on Send / Recv to 1 (NumDataPts_f_halo == 14250 instead). And, of course, moving host arrays around is not of interest.

Running code using an nvcc compiler with additional binding variables (I'm not too sure how they work by copying a method on the Internet somewhere, but given that a simpler device works for calling an MPI device, I don't see a problem with this ), through:

 nvcc TestingMPI.cu -o run_Test -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -L/usr/lib/openmpi/lib -lmpi_cxx -lmpi -ldl 

and compiling with:

 mpirun -np 2 run_Test 

This results in an error that usually looks like this:

 Testing A 1 Testing X 0 [Anastasia:16671] *** Process received signal *** [Anastasia:16671] Signal: Segmentation fault (11) [Anastasia:16671] Signal code: Invalid permissions (2) [Anastasia:16671] Failing at address: 0x700140000 [Anastasia:16671] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f20327774a0] [Anastasia:16671] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x147fe5) [0x7f2032888fe5] [Anastasia:16671] [ 2] /usr/lib/libmpi.so.1(opal_convertor_pack+0x14d) [0x7f20331303bd] [Anastasia:16671] [ 3] /usr/lib/openmpi/lib/openmpi/mca_btl_sm.so(+0x20c8) [0x7f202cad20c8] [Anastasia:16671] [ 4] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x100f0) [0x7f202d9430f0] [Anastasia:16671] [ 5] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x772b) [0x7f202d93a72b] [Anastasia:16671] [ 6] /usr/lib/libmpi.so.1(MPI_Send+0x17b) [0x7f20330bc57b] [Anastasia:16671] [ 7] run_Test() [0x400ff7] [Anastasia:16671] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f203276276d] [Anastasia:16671] [ 9] run_Test() [0x400ce9] [Anastasia:16671] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 16671 on node Anastasia exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- 

I run the code on my laptop (Anastasia), Lenovo Y500 with two GT650m NVIDIA graphics cards running on Linux Ubuntu 12.04LTS, if that matters. nvcc --version gives "release 5.0, V0.2.1221" and mpirun --version gives "mpirun (Open MPI) 1.5.4".

+4
source share
1 answer

Thanks to Anycorn for the help with the code!

If he is interested in anyone with a similar problem, my mistake here was in determining whether I was able to access CUDA memory using MPI calls. I could not MPI_Send / Recv () the GPU memory, so I got "invalid resolution" errors. If anyone has a similar problem, I suggest you test a simple code to send device memory using the MPI_Send / Recv () functions, as suggested by Anycorn in the comments section of the above question.

Make sure to randomly send a pointer to memory with a pointer to a device instead of a memory pointer to a device (requires a pointer in the MPI_Send / Recv () functions, the first argument it takes). I sent this pointer between different nodes, and since the pointer was in the host / CPU memory, the calls worked fine. The result was that node 1 would give node 0 a pointer to a pointer - when I output the data that I thought I collected from node 1, I got the data pointed to by node 0 using a newly received pointer ... this pointed to the same array that I initialized on both nodes through sloppy coding (the initialization line if (node ​​== 1) would save me there). Therefore, I got the correct result and thought that everything was in order.

Thanks again to Anycorn!

+2
source

Source: https://habr.com/ru/post/1495438/


All Articles