Proper use of nested #pragma omp for directives

The following code works like a charm before using OpenMP parallelization. In fact, the following code was in an infinite loop state! I am sure this is the result of misuse of the OpenMP directives. Could you show me the right way? Thank you very much.

#pragma omp parallel for for (int nY = nYTop; nY <= nYBottom; nY++) { for (int nX = nXLeft; nX <= nXRight; nX++) { // Use look-up table for performance dLon = theApp.m_LonLatLUT.LonGrid()[nY][nX] + m_FavoriteSVISSRParams.m_dNadirLon; dLat = theApp.m_LonLatLUT.LatGrid()[nY][nX]; // If you don't want to use longitude/latitude look-up table, uncomment the following line //NOMGeoLocate.XYToGEO(dLon, dLat, nX, nY); if (dLon > 180 || dLat > 180) { continue; } if (Navigation.GeoToXY(dX, dY, dLon, dLat, 0) > 0) { continue; } // Skip void data scanline dY = dY - nScanlineOffset; // Compute coefficients as well as its four neighboring points' values nX1 = int(dX); nX2 = nX1 + 1; nY1 = int(dY); nY2 = nY1 + 1; dCx = dX - nX1; dCy = dY - nY1; dP1 = pIRChannelData->operator [](nY1)[nX1]; dP2 = pIRChannelData->operator [](nY1)[nX2]; dP3 = pIRChannelData->operator [](nY2)[nX1]; dP4 = pIRChannelData->operator [](nY2)[nX2]; // Bilinear interpolation usNomDataBlock[nY][nX] = (unsigned short)BilinearInterpolation(dCx, dCy, dP1, dP2, dP3, dP4); } } 
+4
source share
4 answers

Do not insert it too deep. As a rule, it would be enough to determine a good point for parallelization and leave with one directive.

Some comments and possibly the root of your problem:

  #pragma omp parallel default(shared) // Here you open several threads ... { #pragma omp for for (int nY = nYTop; nY <= nYBottom; nY++) { #pragma omp parallel shared(nY, nYBottom) // Same here ... { #pragma omp for for (int nX = nXLeft; nX <= nXRight; nX++) { 

(Conceptually) you open many threads, in each of which you again open many threads in a for loop. For each thread in the for loop, you open many threads again, and for each of them you open many more in the other for the loop.

This is (thread (thread)*)+ in matching patterns; should only be thread+

Just do one parallel. You should not be fine-grained, parallelized along the outer contour, each stream should work as long as possible:

 #pragma omp parallel for for (int nY = nYTop; nY <= nYBottom; nY++) { for (int nX = nXLeft; nX <= nXRight; nX++) { } } 

Avoid sharing data and caches between streams (another reason why streams should not be too fine-grained on your data).

If it works stably and shows good speed, you can configure it using different scheduling algorithms according to your original OpenMP map.

And place variable declarations where you really need them. Do not overwrite what sisters read.

+4
source

You can also effectively roll multiple cycles. There are restrictions on the conditions of the cycle: they must be independent. Moreover, not all compilers support lexem collapse. (As for gcc with OpenMP, it works.)

  int i,j,k; #pragma omp parallel for collapse(3) for(i=0; i<=N-1; i++) for(j=0; j<=N-1; j++) for(k=0; k<=N-1; k++) { // something useful... } 
+3
source

In practice, it is usually most advantageous to parallelize only the most closed loop. Parallelizing all internal loops can give you too many threads (although OpenMP adheres to the number of hardware execution units, unless otherwise specified). And more importantly, parallelizing the inner loop most likely creates and destroys threads too often, and this is an expensive operation. Your processor will make streaming API calls instead of your workload.

Not quite the answer, but I decided to share the experience.

+2
source

There are write security issues for all variables assigned in the inner loop. Each thread will try to assign values ​​to the same variables, most likely you will get garbage. For example, two threads can update dLon at the same time, whereby thread 1 passes the value of thread 2 to Navigation.GeoToXY(dX, dY, dLon, dLat, 0) . Because you are invoking other methods in a loop, these methods, invoked by spam arguments, may not end.

To solve this problem, declare local variables in the outer loop immediately after applying omp parallel for or use private firstprivate like firstprivate to force OpenMP to automatically create local variables for each stream. In the case of firstprivate it will copy the initialized global value. For instance,

 int dLon = 0; #pragma omp parallel for firstprivate(dLon) // dLon = 0 for each thread for (...) { // each thread has it own dLon variable so there no clash in writing dLon = ...; } 

Read more about the offers here: https://computing.llnl.gov/tutorials/openMP/

+2
source

Source: https://habr.com/ru/post/1382920/


All Articles