Reduce the use of OpenMP fork / join by splitting #omp parallel and #omp for

I am reading the book Introduction to Parallel Programming by Peter S. Pacheco. In section 5.6.2, he gave an interesting discussion about reducing fork / join overhead. Consider the odd-even transposition sorting algorithm:

for(phase=0; phase < n; phase++){ if(phase is even){ # pragma omp parallel for default(none) shared(n) private(i) for(i=1; i<n; i+=2){//meat} } else{ # pragma omp parallel for default(none) shared(n) private(i) for(i=1; i<n-1; i+=2){//meat} } } 

The author claims that the code above has somewhat high fork / join overhead. Since the threads branch and join in each iteration of the outer contour. Therefore, he offers the following version:

 # pragma omp parallel default(none) shared(n) private(i, phase) for(phase=0; phase < n; phase++){ if(phase is even){ # pragma omp for for(i=1; i<n; i+=2){//meat} } else{ # pragma omp for for(i=1; i<n-1; i+=2){//meat} } } 

According to the authors, the second version of the thread fork before starting the external loop and reusing threads for each iteration, which gives better performance.

However, I am suspicious of the correctness of the second version. In my opinion, the #pragma omp parallel initiates a group of threads and allows threads to execute the next structured block in parallel. In this case, the structured block should be all external for the for loop for(phase=0 ...) . Then, should it not be when the entire outer loop is executed four times using four threads? That is, if n=10 , then 40 iterations will be performed on 4 threads. What is wrong with my understanding? And how does omp parallel (without) work for the next for loop, as stated above?

+6
source share
1 answer

The second version is correct.

According to the OpenMP specification, the #pragma omp parallel for is just a shortcut to #pragma omp parallel , followed by #pragma omp for , as in

 #pragma omp parallel { #pragma omp for for(int i=0; i<N; ++i) { /*loop body*/ } } 

If there is any code in the parallel area before or after the loop loop, it will be executed independently by each thread in the region (if not limited to other OpenMP directives). But #pragma omp for is a collaboration; The cycle following this directive is shared by all threads in the region. That is, it runs as a single cycle, the iterations of which are somehow broken down into threads. Thus, if the parallel region above is executed by four threads, then the cycle will be executed only once, and not 4 times.

Let's go back to the example in your question: the phase loop is executed by each thread separately, but #pragma omp for at each iteration of the phase indicates the beginning of the general loop. When n = 10, each thread enters the general cycle 10 times and performs part of it; therefore there should not be 40 executions of internal loops, but only 10.

Note that there is an implicit barrier at the end of #pragma omp for ; this means that a thread that has completed its part of the overall cycle will not continue until all other threads have finished their parts. Thus, execution is synchronized across threads. This is necessary to ensure correctness in most cases; for example, in your example, this ensures that threads always work on the same phase. However, if the subsequent general contours within the region are safe for simultaneous execution, the nowait can be used to remove the implicit barrier and allows threads to jump to the rest of the parallel region immediately.

Note also that this handling of sharing directives is quite specific to OpenMP. In other parallel programming systems, the logic you used in the question may be correct.

And finally, smart OpenMP implementations do not join threads after completing the parallel area; threads may be busy instead β€” wait a while and then sleep until another parallel area begins. This is done precisely to prevent high overhead at the beginning and end of parallel areas. Thus, although the optimization proposed in the book still eliminates some overhead (possibly), for some algorithms its effect on runtime may be negligible. The algorithm in question is likely one of them; in the first implementation, parallel areas quickly follow one after another in a sequential loop, so OpenMP workflows are likely to be active at the beginning of the region and start quickly, avoiding the overhead of fork / join. Therefore, do not be surprised if in practice you do not see a difference in performance from the described optimization.

+8
source

Source: https://habr.com/ru/post/978796/


All Articles