Controlling the number of threads in parallel loops and reducing overhead

In my Fortran 95 code, I have a series of nested DO loops that take a considerable amount of time to complete, so I wanted to add parallel functions to OpenMP (using gfortran -fopenmp to compile / build).

There is one main DO loop running 1000 times.

Inside this loop, there is a sub-DO loop that runs 100 times.

Several other DO loops are embedded in it, the number of iterations increases with each iteration of the DO loop (once for the first time, up to 1000 for the last time).

Example:

 DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations END DO DO c = 1, d some calculations END DO DO c = 1, d some calculations END DO END DO d = d + 1 END DO 

Some of the nested DO loops must be run sequentially, because they contain dependencies within themselves (that is, each iteration of the loop has a calculation that includes the value from the previous iteration) and cannot be easily parallelized in this case.

I can easily make loops without any dependencies running in parallel, as shown below:

 d = 1 DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations with dependencies END DO !$OMP PARALLEL !$OMP DO DO c = 1, d some calculations without dependencies END DO !$OMP END DO !$OMP END PARALLEL DO c = 1, d some calculations with dependencies END DO END DO d = d + 1 END DO 

However, I understand that there is significant overhead when opening and closing parallel threads, given that this happens so many times in cycles. The code is much slower than before when it was run sequentially.

After that, I realized that it makes sense to open and close the parallel code on each side of the main loop (therefore, only once the overhead) and set the number of threads to 1 or 8 to control whether the sections are started sequentially or in parallel, as shown below:

 d = 1 CALL omp_set_num_threads(1) !$OMP PARALLEL DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations with dependencies END DO CALL omp_set_num_threads(4) !$OMP DO DO c = 1, d some calculations without dependencies END DO !$OMP END DO CALL omp_set_num_threads(1) DO c = 1, d some calculations with dependencies END DO END DO d = d + 1 END DO !$OMP END PARALLEL 

However, when I set this to run, I do not get the acceleration that I expected from running parallel code. I expect the first few to be slower to consider overhead, but after a while I expect that parallel code will work faster than serial code that doesn't exist. I compared how quickly each iteration of the main DO loop performed, for DO a = 1, 50 , the results are below:

 Iteration Serial Parallel 1 3.8125 4.0781 2 5.5781 5.9843 3 7.4375 7.9218 4 9.2656 9.7500 ... 48 89.0625 94.9531 49 91.0937 97.3281 50 92.6406 99.6093 

My first thought is that I somehow did not correctly set the number of threads.

Questions:

  • Is there something clearly wrong with how I structured the parallel code?
  • Is there a better way to implement what I have done / want to do?
+5
source share
2 answers

In fact, there is something that is clearly wrong: you removed parallelism from your code. Before creating the outermost parallel region, you determined its size as a single stream. Therefore, only one thread will be created to process any code within this region. Subsequently, using omp_set_num_threads(4) will not change this. This call simply says that no matter what the next parallel directive creates 4 threads (unless explicitly stated otherwise). But there is no new parallel directive that would be nested here in the current one. You only have the do directive that applies to the current spanning parallel region of one unique stream.

There are two ways to solve your problem:

  • Keeping your code as it was: although formally you will fork and join your threads after you enter and exit the parallel , the OpenMP standard does not request that threads be created and destroyed. In fact, it even encourages threads to stay alive in order to reduce the overhead of the parallel directive, which is done by most OpenMP runtime libraries. Therefore, the payload of such a simple approach to the problem is not too great.

  • Using your second approach, pushing the parallel directive outside the outer loop, but creating as many threads as needed for sharing (here I count). Then you include everything that should be consistent in your parallel scope with the single directive. This will ensure that there is no unwanted interaction with additional threads (implicit barrier and clearing the shared variable on exit), while avoiding parallelism where you do not want it.

This latest version will look like this:

 d = 1 !$omp parallel num_threads( 4 ) private( a, b, c ) firstprivate( d ) do a = 1, 1000 do b = 1, 100 !$omp single do c = 1, d some calculations with dependencies end do !$omp end single !$omp do do c = 1, d some calculations without dependencies end do !$omp end do !$omp single do c = 1, d some calculations with dependencies end do !$omp end single end do d = d + 1 end do !$omp end parallel 

Now, whether this version will actually be faster compared to the naive one, you need to check.

One final note: since there are a lot of sequential parts in your code, don't expect too much speed. Amdahl's Law forever.

+2
source
  • Nothing is clearly wrong, but if sequential cycles take a long time, your acceleration will be limited. For parallel computing, you may need to redesign your algorithms.
  • Instead of setting the number of threads in a loop, use the !$omp master - !$omp end master directives to reduce execution to a single thread. Add !$omp barrier if you can run this block only after all other threads have completed.
+1
source

Source: https://habr.com/ru/post/1261676/


All Articles