In my Fortran 95 code, I have a series of nested DO loops that take a considerable amount of time to complete, so I wanted to add parallel functions to OpenMP (using gfortran -fopenmp to compile / build).
There is one main DO loop running 1000 times.
Inside this loop, there is a sub-DO loop that runs 100 times.
Several other DO loops are embedded in it, the number of iterations increases with each iteration of the DO loop (once for the first time, up to 1000 for the last time).
Example:
DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations END DO DO c = 1, d some calculations END DO DO c = 1, d some calculations END DO END DO d = d + 1 END DO
Some of the nested DO loops must be run sequentially, because they contain dependencies within themselves (that is, each iteration of the loop has a calculation that includes the value from the previous iteration) and cannot be easily parallelized in this case.
I can easily make loops without any dependencies running in parallel, as shown below:
d = 1 DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations with dependencies END DO !$OMP PARALLEL !$OMP DO DO c = 1, d some calculations without dependencies END DO !$OMP END DO !$OMP END PARALLEL DO c = 1, d some calculations with dependencies END DO END DO d = d + 1 END DO
However, I understand that there is significant overhead when opening and closing parallel threads, given that this happens so many times in cycles. The code is much slower than before when it was run sequentially.
After that, I realized that it makes sense to open and close the parallel code on each side of the main loop (therefore, only once the overhead) and set the number of threads to 1 or 8 to control whether the sections are started sequentially or in parallel, as shown below:
d = 1 CALL omp_set_num_threads(1) !$OMP PARALLEL DO a = 1, 1000 DO b = 1, 100 DO c = 1, d some calculations with dependencies END DO CALL omp_set_num_threads(4) !$OMP DO DO c = 1, d some calculations without dependencies END DO !$OMP END DO CALL omp_set_num_threads(1) DO c = 1, d some calculations with dependencies END DO END DO d = d + 1 END DO !$OMP END PARALLEL
However, when I set this to run, I do not get the acceleration that I expected from running parallel code. I expect the first few to be slower to consider overhead, but after a while I expect that parallel code will work faster than serial code that doesn't exist. I compared how quickly each iteration of the main DO loop performed, for DO a = 1, 50 , the results are below:
Iteration Serial Parallel 1 3.8125 4.0781 2 5.5781 5.9843 3 7.4375 7.9218 4 9.2656 9.7500 ... 48 89.0625 94.9531 49 91.0937 97.3281 50 92.6406 99.6093
My first thought is that I somehow did not correctly set the number of threads.
Questions:
- Is there something clearly wrong with how I structured the parallel code?
- Is there a better way to implement what I have done / want to do?