OpenACC red-black Gauss-Seidel slower than a processor

Question

OpenACC red-black Gauss-Seidel slower than a processor

I added OpenACC directives to my red and black Gauss-Seidel solver for the Laplace equation (a simple problem with a hot plate), but GPU-accelerated code is not faster than a processor, even for big problems.

I also wrote a CUDA version, and it is much faster than both (for 512x512, about 2 seconds compared to 25 for CPU and OpenACC).

Can anyone think about the reason for this discrepancy? I understand that CUDA offers the greatest potential speed, but OpenACC should give something better than a processor for big problems (for example, the Jacobi solver for the same problem demonstrated here ).

Here is the relevant code (full working source here ):

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp]) // red-black Gauss-Seidel with SOR iteration loop for (iter = 1; iter <= it_max; ++iter) { Real norm_L2 = 0.0; // update red cells #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ reduction(+:norm_L2) #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) #pragma acc loop independent gang vector(4) for (int col = 1; col < NUM + 1; ++col) { #pragma acc loop independent gang vector(64) for (int row = 1; row < (NUM / 2) + 1; ++row) { int ind_red = col * ((NUM / 2) + 2) + row; // local (red) index int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)] + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)] + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)] + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]); Real temp_old = temp_red[ind_red]; temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]); // calculate residual res = temp_red[ind_red] - temp_old; norm_L2 += (res * res); } // end for row } // end for col // update black cells #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \ reduction(+:norm_L2) #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp]) #pragma acc loop independent gang vector(4) for (int col = 1; col < NUM + 1; ++col) { #pragma acc loop independent gang vector(64) for (int row = 1; row < (NUM / 2) + 1; ++row) { int ind_black = col * ((NUM / 2) + 2) + row; // local (black) index int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind]) Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)] + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)] + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)] + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]); Real temp_old = temp_black[ind_black]; temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]); // calculate residual res = temp_black[ind_black] - temp_old; norm_L2 += (res * res); } // end for row } // end for col // calculate residual norm_L2 = sqrt(norm_L2 / ((Real)size)); if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2); // if tolerance has been reached, end SOR iterations if (norm_L2 < tol) { break; } }

+4

gpu openacc

Kyle niemeyer Oct 19 '12 at 18:00

source share

2 answers

I download the full code and compile it and run! The execution and instruction did not stop

if (iter% 100 == 0) printf ("% 5d,% 0.6f \ n", iter, norm_L2);

result:

100, nan
200, nan
....

I changed all variables with type Real to type float , and the result was:

100, 0.000654
200, 0.000370
..., ....
..., ....
8800, 0.000002
8900, 0.000002
9000, 0.000001
9100, 0.000001
9200, 0.000001
9300, 0.000001
9400, 0.000001
9500, 0.000001
9600, 0.000001
9700, 0.000001
CPU
Iterations: 9796
Total time: 5.594017 s

With NUM = 1024, the result was:

Iterations: 27271
Total time: 25.949905 s

0

pg1927 Nov 25 '12 at 10:38

source share

Kyle niemeyer · Accepted Answer · 2012-10-19T20:04:59+0000

Well, I found a half-solution that significantly reduces time for small problems.

If I insert the lines:

 acc_init(acc_device_nvidia); acc_set_device_num(0, acc_device_nvidia);

before I start my timer to activate and install the GPU, the time for the 512x512 problem drops to 9.8 seconds and to 42 for 1024x1024. Increasing the size of the problem further shows how quickly even OpenACC can be compared to running on four processor cores.

With this change, the OpenACC code is an order of magnitude 2 times slower than the CUDA code, while the gap becomes closer to a slightly slower one (~ 1.2), as the size of the problem becomes more and more.

OpenACC red-black Gauss-Seidel slower than a processor

More articles: