The second version is correct.
According to the OpenMP specification, the #pragma omp parallel for
is just a shortcut to #pragma omp parallel
, followed by #pragma omp for
, as in
#pragma omp parallel { #pragma omp for for(int i=0; i<N; ++i) { } }
If there is any code in the parallel area before or after the loop loop, it will be executed independently by each thread in the region (if not limited to other OpenMP directives). But #pragma omp for
is a collaboration; The cycle following this directive is shared by all threads in the region. That is, it runs as a single cycle, the iterations of which are somehow broken down into threads. Thus, if the parallel region above is executed by four threads, then the cycle will be executed only once, and not 4 times.
Let's go back to the example in your question: the phase loop is executed by each thread separately, but #pragma omp for
at each iteration of the phase indicates the beginning of the general loop. When n = 10, each thread enters the general cycle 10 times and performs part of it; therefore there should not be 40 executions of internal loops, but only 10.
Note that there is an implicit barrier at the end of #pragma omp for
; this means that a thread that has completed its part of the overall cycle will not continue until all other threads have finished their parts. Thus, execution is synchronized across threads. This is necessary to ensure correctness in most cases; for example, in your example, this ensures that threads always work on the same phase. However, if the subsequent general contours within the region are safe for simultaneous execution, the nowait
can be used to remove the implicit barrier and allows threads to jump to the rest of the parallel region immediately.
Note also that this handling of sharing directives is quite specific to OpenMP. In other parallel programming systems, the logic you used in the question may be correct.
And finally, smart OpenMP implementations do not join threads after completing the parallel area; threads may be busy instead β wait a while and then sleep until another parallel area begins. This is done precisely to prevent high overhead at the beginning and end of parallel areas. Thus, although the optimization proposed in the book still eliminates some overhead (possibly), for some algorithms its effect on runtime may be negligible. The algorithm in question is likely one of them; in the first implementation, parallel areas quickly follow one after another in a sequential loop, so OpenMP workflows are likely to be active at the beginning of the region and start quickly, avoiding the overhead of fork / join. Therefore, do not be surprised if in practice you do not see a difference in performance from the described optimization.