Nested loops, parallelizing the inner loop, reusing threads

Disclaimer: The following example is just a fictitious example to quickly understand the problem. If you are thinking about a real world problem, think about dynamic programming.

Problem: We have an n * m matrix, and we want to copy the elements from the previous row, as in the following code:

for (i = 1; i < n; i++) for (j = 0; j < m; j++) x[i][j] = x[i-1][j]; 

Approach: Iterations of the outer loop should be performed in order, they will be executed sequentially. The inner loop can be parallelized. We would like to minimize the overhead of creating and destroying threads, so we would like to create a thread team only once, however this seems like an impossible task in OpenMP.

 #pragma omp parallel private(j) { for (i = 1; i < n; i++) { #pragma omp for scheduled(dynamic) for (j = 0; j < m; j++) x[i][j] = x[i-1][j]; } } 

When we use the ordered option in the outer loop, the code will be executed in a sequential manner, so there will be no performance gain. I am looking for a solution for the scenario above, even if I had to use some workaround.

I am adding my actual code. It is actually slower than seq. version. Please view:

 /* load input */ for (i = 1; i <= n; i++) scanf ("%d %d", &in[i][W], &in[i][V]); /* init */ for (i = 0; i <= wc; i++) a[0][i] = 0; /* compute */ #pragma omp parallel private(i,w) { for(i = 1; i <= n; ++i) // 1 000 000 { j=i%2; jn = j == 1 ? 0 : 1; #pragma omp for for(w = 0; w <= in[i][W]; w++) // 1000 a[j][w] = a[jn][w]; #pragma omp for for(w = in[i][W]+1; w <= wc; w++) // 350 000 a[j][w] = max(a[jn][w], in[i][V] + a[jn][w-in[i][W]]); } } 

As for the measurement, I am using something like this:

 double t; t = omp_get_wtime(); // ... t = omp_get_wtime() - t; 
+5
source share
1 answer

To summarize the parallelization in OpenMP for this particular case: it is not worth it.

Why? The operations in the inner loops are simple. The code was compiled with -O3 , so the max() call was replaced with the function body code. The overhead in the implicit barrier is probably high enough to compensate for the performance gain, and the overall overhead is high enough to make the parallel code even slower than the serial code. I also learned that in this design there is no real performance gain:

 #pragma omp parallel private(i,j) { for (i = 1; i < n; i++) { #pragma omp for for (j = 0; j < m; j++) x[i][j] = x[i-1][j]; } } 

because performance is similar to this feature

 for (i = 1; i < n; i++) { #pragma omp parallel for private(j) for (j = 0; j < m; j++) x[i][j] = x[i-1][j]; } 

thanks to the built-in thread reuse in GCC libgomp , according to this article: http://bisqwit.iki.fi/story/howto/openmp/

Since the outer loop cannot be parallelized (without the ordered option), there seems to be no way to significantly improve the performance of the program in question using OpenMP. If someone feels that I did something wrong, and it is possible, I will be happy to see and test the solution.

0
source

Source: https://habr.com/ru/post/1208523/


All Articles