Vector Fill Through OpenMP Streams

I have an algorithm for which one purpose is to populate vectors. For the sake of performance, iterations of the algorithm propagate through OpenMP streams. I was wondering which way would provide the best / safest way to populate vectors.

Note that the ordering of the vectors must be consistent (i.e., the value of n from vec1 must come from the same iteration as the value of n from vec2.)

Hypothesis 1:

std::vector<BasicType> vec1; std::vector<BasicType> vec2; #pragma opm parallel for for(...) { // Do some intensive stuff to compute val1 and val2 // ... #pragma omp critical { vec1.push_back(val1); vec2.push_back(val2); } } // Then go on to work with vec1 and vec2... 

Hypothesis 2:

 std::vector<BasicType> vec1; std::vector<BasicType> vec2; #pragma opm parallel { std::vector<BasicType> vec1local; std::vector<BasicType> vec2local; #pragma omp for for(...) { // Do some intensive stuff to compute val1 and val2 // ... vec1local.push_back(val1); vec2local.push_back(val2); } #pragma omp critical { // See note 1 below vec1.insert(vec1.end(), vec1local.begin(), vec1local.end()); vec2.insert(vec2.end(), vec2local.begin(), vec2local.end()); } } // Then go on to work with vec1 and vec2... 

Note 1: This was taken from the Best way to add a vector to a vector.

Note 2: It seems that val1 and val2 can be combined in any object to maintain consistency and work with only one vector, but so far it seems practical for the rest of the algorithm.

Note 3: To give order, the for loop contains about 100 iterations, divided into 4 threads. With the exception of a few exceptions, each iteration is expected to have the same workload (which leads to critical partitions occurring at approximately the same time.)

Note 4: For recording only, this is all about image stabilization and works on the Tegra K1 architecture. The gcc 4.8.4 compiler is used.

+5
source share
2 answers

I'm going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads() , you populate the vectors in parallel and then combine them in log2(t) operations instead of t operations (as you are doing now), like this

 void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) { if(end - begin == 1) return; int pivot = (begin+end)/2; #pragma omp task reduce(v, begin, pivot); #pragma omp task reduce(v, pivot, end); #pragma omp taskwait v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end()); v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end()); } 

and

 std::vector<BasicType> v1, v2; std::vector<BasicType> *v1p, *v2p; #pragma omp parallel { #pragma omp single { v1p = new std::vector<BasicType>[omp_get_num_threads()]; v2p = new std::vector<BasicType>[omp_get_num_threads()]; } #pragma omp for for(...) { // Do some intensive stuff to compute val1 and val2 // ... v1p[omp_get_thread_num()].push_back(val1); v2p[omp_get_thread_num()].push_back(val2); } #pragma omp single reduce(v1p, v2p, 0, omp_get_num_threads()); } v1 = v1p[0], v2 = v2p[0]; delete[] v1p; delete[] v2p; 

For example, using eight threads, this appends vectors to the threads

 (0,1) (2,3) (4,5) (6,7) (0,2) (4,6) (0,4) 

For more information on parallel vector filling, see this . For more information on thread merging in log2(t) operations, see the answer to this question .

+1
source

Of your two proposals, I would prefer the first. It is simpler and does not require additional memory. However, I would suggest an alternative without a critical section:

 std::vector<BasicType> vec1(size); std::vector<BasicType> vec2(size); #pragma opm parallel for for(...) { // Do some intensive stuff to compute val1 and val2 // ... vec1[i] = val1; vec2[i] = val2; } 

Note that this may have some performance issues due to theft of the cache line. However, I would not worry about this if this does not become a real problem, confirmed by an actual performance analysis. Then the solution could be:

  • Go with your second decision. (which costs memory and additional post-processing).
  • Align your vector and use the appropriate pieces for the loop. (so that each thread has local cache lines)
  • Use a data structure that internally contains local vectors, but externally provides the necessary global vector operations. (This is probably the best solution.)
+2
source

Source: https://habr.com/ru/post/1246162/


All Articles