Why is OpenMP slow in this case?

I am trying to understand why OpenMP works the way it does in the following example.

#include <omp.h> #include <iostream> #include <vector> #include <stdlib.h> void AddVectors (std::vector< double >& v1, std::vector< double >& v2) { size_t i; #pragma omp parallel for private(i) for (i = 0; i < v1.size(); i++) v1[i] += v2[i]; } int main (int argc, char** argv) { size_t N1 = atoi(argv[1]); std::vector< double > v1(N1,1); std::vector< double > v2(N1,2); for (size_t i = 0; i < N1; i++) AddVectors(v1,v2); return 0; } 

First, I compiled the code above without enabling OpenMP (omitting -fopenmp on compilation flags). The run time for N1 = 10000 was 0.1 s. Enabling OpenMP causes runtimes to exceed 1 minute . I stopped him before it was done (tired of waiting ...).

I am compiling the code as shown below:

g ++ -std = C ++ 0x -O3 -funroll-loops -march = core2 -fomit-frame-pointer -Wall -fno-strict-aliasing -o main.o -c main.cpp

g ++ main.o -o main

Not all of these flags are needed here, but I use them in a project that I'm trying to parallelize, and I use these flags there. That is why I decided to leave them here. In addition, I add -fopenmp to enable OpenMP in compilation.

Does anyone know what is going on? Thanks!

+4
source share
3 answers

I tried the same example in Visual Studio 2008. I made two modifications to your sample code, and it works about 3 times faster with OpenMP than without OpenMP.

Unable to confirm it in GCC, the problem may be in the main loop when the AddVectors function is called, and every time it needs to perform a fork operation, and this will take some measurable time. Therefore, if you have N1 = 10000, it should start 10000 operations "fork".

I added my own piece of code, modified only to make it work in Visual Studio, and I added a print expression at the end to avoid the compiler deleting all the code.

 #include <omp.h> #include <iostream> #include <vector> #include <stdlib.h> void AddVectors (std::vector< double >& v1, std::vector< double >& v2) { #pragma omp parallel for for (int i = 0; i < static_cast<int>(v1.size()); i++) v1[i] += v2[i]; } int main (int argc, char** argv) { size_t N1 = atoi(argv[1]); std::vector< double > v1(N1,1); std::vector< double > v2(N1,2); for (size_t i = 0; i < N1; i++) AddVectors(v1,v2); printf("%g\n",v1[0]); return 0; } 
+2
source

Could g ++ be optimized for whole AddVectors calls? Try to return the last element v1 and save it in the volatile variable.

+1
source

The problem is using an array type .

Vector is a container. This is a structure that stores several data, such as size, start, end, etc .; and has several built-in functions, where the [] operator is one of them used to access data. As a result, cache lines that tend to load say for the index ā€œiā€ of the vector V , load the element V [i] and some information that is not used in the code.

Conversely, if you use classic arrays (dynamic / static), the [] operator will load only data items. As a result, the cache line (usually 64 bytes) will load 8 elements of this double array (double size = 8 bytes).

See the difference between _mm_malloc and malloc for better data alignment.

@Mr Fooz I'm not sure about that. Let's compare the performance results for both cases:

4 threads on i7 processor

Array time: 0.122007 | Repeat: 4 | MFlops: 327.85

Vector time: 0.101006 | Repeat: 2 | MFlops: 188.669

I force the runtime to be greater than 0.1 s, so the code repeats. The main loop:

 const int N = 10000000; timing(&wcs); for(; runtime < 0.1; repeat*=2) { for(int r = 0; r < repeat; r++) { #pragma omp parallel for for(int i = 0; i < N; i++) { A[i] += B[i]; } if(A[0]==0) dummy(A[0]); } timing(&wce); runtime = wce-wcs; } 

MFLops: ((N * repeat) / runtime) / 1000000

+1
source

Source: https://habr.com/ru/post/1344827/


All Articles