OpenACC red-black Gauss-Seidel slower than a processor

I added OpenACC directives to my red and black Gauss-Seidel solver for the Laplace equation (a simple problem with a hot plate), but GPU-accelerated code is not faster than a processor, even for big problems.

I also wrote a CUDA version, and it is much faster than both (for 512x512, about 2 seconds compared to 25 for CPU and OpenACC).

Can anyone think about the reason for this discrepancy? I understand that CUDA offers the greatest potential speed, but OpenACC should give something better than a processor for big problems (for example, the Jacobi solver for the same problem demonstrated here ).

Here is the relevant code (full working source here ):

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp]) // red-black Gauss-Seidel with SOR iteration loop for (iter = 1; iter <= it_max; ++iter) { Real norm_L2 = 0.0; // update red cells #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ reduction(+:norm_L2) #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) #pragma acc loop independent gang vector(4) for (int col = 1; col < NUM + 1; ++col) { #pragma acc loop independent gang vector(64) for (int row = 1; row < (NUM / 2) + 1; ++row) { int ind_red = col * ((NUM / 2) + 2) + row; // local (red) index int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)] + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)] + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)] + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]); Real temp_old = temp_red[ind_red]; temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]); // calculate residual res = temp_red[ind_red] - temp_old; norm_L2 += (res * res); } // end for row } // end for col // update black cells #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ reduction(+:norm_L2) #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) #pragma acc loop independent gang vector(4) for (int col = 1; col < NUM + 1; ++col) { #pragma acc loop independent gang vector(64) for (int row = 1; row < (NUM / 2) + 1; ++row) { int ind_black = col * ((NUM / 2) + 2) + row; // local (black) index int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)] + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)] + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)] + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]); Real temp_old = temp_black[ind_black]; temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]); // calculate residual res = temp_black[ind_black] - temp_old; norm_L2 += (res * res); } // end for row } // end for col // calculate residual norm_L2 = sqrt(norm_L2 / ((Real)size)); if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2); // if tolerance has been reached, end SOR iterations if (norm_L2 < tol) { break; } } 
+4
source share
2 answers

Well, I found a half-solution that significantly reduces time for small problems.

If I insert the lines:

 acc_init(acc_device_nvidia); acc_set_device_num(0, acc_device_nvidia); 

before I start my timer to activate and install the GPU, the time for the 512x512 problem drops to 9.8 seconds and to 42 for 1024x1024. Increasing the size of the problem further shows how quickly even OpenACC can be compared to running on four processor cores.

With this change, the OpenACC code is an order of magnitude 2 times slower than the CUDA code, while the gap becomes closer to a slightly slower one (~ 1.2), as the size of the problem becomes more and more.

+2
source

I download the full code and compile it and run! The execution and instruction did not stop

if (iter% 100 == 0) printf ("% 5d,% 0.6f \ n", iter, norm_L2);

result:

100, nan

200, nan

....

I changed all variables with type Real to type float , and the result was:

100, 0.000654

200, 0.000370

..., ....

..., ....

8800, 0.000002

8900, 0.000002

9000, 0.000001

9100, 0.000001

9200, 0.000001

9300, 0.000001

9400, 0.000001

9500, 0.000001

9600, 0.000001

9700, 0.000001

CPU

Iterations: 9796

Total time: 5.594017 s

With NUM = 1024, the result was:

Iterations: 27271

Total time: 25.949905 s

0
source

Source: https://habr.com/ru/post/1440846/


All Articles