I'm going to assume that you need to use a vector and cannot use an array (othrewise your question is not very interesting). Using t = omp_get_num_threads() , you populate the vectors in parallel and then combine them in log2(t) operations instead of t operations (as you are doing now), like this
void reduce(std::vector<BasicType> *v1, std::vector<BasicType> *v2, int begin, int end) { if(end - begin == 1) return; int pivot = (begin+end)/2; #pragma omp task reduce(v, begin, pivot); #pragma omp task reduce(v, pivot, end); #pragma omp taskwait v1[begin].insert(v1[begin].end(), v1[pivot].begin(), v1[pivot].end()); v2[begin].insert(v2[begin].end(), v2[pivot].begin(), v2[pivot].end()); }
and
std::vector<BasicType> v1, v2; std::vector<BasicType> *v1p, *v2p; #pragma omp parallel { #pragma omp single { v1p = new std::vector<BasicType>[omp_get_num_threads()]; v2p = new std::vector<BasicType>[omp_get_num_threads()]; } #pragma omp for for(...) { // Do some intensive stuff to compute val1 and val2 // ... v1p[omp_get_thread_num()].push_back(val1); v2p[omp_get_thread_num()].push_back(val2); } #pragma omp single reduce(v1p, v2p, 0, omp_get_num_threads()); } v1 = v1p[0], v2 = v2p[0]; delete[] v1p; delete[] v2p;
For example, using eight threads, this appends vectors to the threads
(0,1) (2,3) (4,5) (6,7) (0,2) (4,6) (0,4)
For more information on parallel vector filling, see this . For more information on thread merging in log2(t) operations, see the answer to this question .