Do not insert it too deep. As a rule, it would be enough to determine a good point for parallelization and leave with one directive.
Some comments and possibly the root of your problem:
#pragma omp parallel default(shared) // Here you open several threads ... { #pragma omp for for (int nY = nYTop; nY <= nYBottom; nY++) { #pragma omp parallel shared(nY, nYBottom) // Same here ... { #pragma omp for for (int nX = nXLeft; nX <= nXRight; nX++) {
(Conceptually) you open many threads, in each of which you again open many threads in a for loop. For each thread in the for loop, you open many threads again, and for each of them you open many more in the other for the loop.
This is (thread (thread)*)+
in matching patterns; should only be thread+
Just do one parallel. You should not be fine-grained, parallelized along the outer contour, each stream should work as long as possible:
#pragma omp parallel for for (int nY = nYTop; nY <= nYBottom; nY++) { for (int nX = nXLeft; nX <= nXRight; nX++) { } }
Avoid sharing data and caches between streams (another reason why streams should not be too fine-grained on your data).
If it works stably and shows good speed, you can configure it using different scheduling algorithms according to your original OpenMP map.
And place variable declarations where you really need them. Do not overwrite what sisters read.
source share