I have, it seems, a very simple parallel for loop that simply writes zeros to an integer array. But it turns out that the more threads, the slower the cycle. I thought this was due to some caching, so I played with schedules, chunk size, __restrict__ , inserting the parallel inside the parallel block and resetting it. Then I noticed that reading an array to reduce is also slower.
It should be obvious very simple and should accelerate almost linearly. What am I missing here?
Full code:
#include <omp.h> #include <vector> #include <iostream> #include <ctime> void tic(), toc(); int main(int argc, const char *argv[]) { const int COUNT = 100; const size_t sz = 250000 * 200; std::vector<int> vec(sz, 1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } toc(); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); std::cout << "serial memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } toc(); std::cout << "parallel memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } toc(); return 0; } static clock_t ClockCounter; void tic() { ClockCounter = std::clock(); } void toc() { ClockCounter = std::clock() - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; }
Execution of this result:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1790000 parallel reduction elapsed clock ticks: 19690000 serial memset elapsed clock ticks: 3860000 parallel memset elapsed clock ticks: 20800000
If I run with -O2 , g ++ can optimize the sequential reduction, and I get zero time, so -O1 . In addition, the placement of omp_set_num_threads(1); makes times more similar, although there are still some differences:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 1 serial reduction elapsed clock ticks: 1770000 parallel reduction elapsed clock ticks: 7370000 serial memset elapsed clock ticks: 2290000 parallel memset elapsed clock ticks: 3550000
This should be pretty obvious, I feel like I don't see anything elementary. My processor is Intel (R) Xeon (R) CPU E5-2640 0 @ 2.50 GHz with a hyperthread, but the same behavior is observed in an i5 colleague with 4 cores and without a hyperthread. We both run Linux.
EDIT
It seems that one of the errors was aside from time while working with:
static double ClockCounter; void tic() { ClockCounter = omp_get_wtime();
gives more "reasonable" times:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1.80974 parallel reduction elapsed clock ticks: 2.07367 serial memset elapsed clock ticks: 2.37713 parallel memset elapsed clock ticks: 2.23609
But still there is no acceleration, it is simply not slower.
EDIT2
As suggested by user 8046, the code is highly memory related. and, as suggested by the Z-boson, the serial code is easily optimized, and it is not known what is measured here. Therefore, I made a small replacement for summing the sum outside the loop, so that it does not nullify at each iteration over c . I also replaced the reduction operation with sum+=F(vec[i]) and the memset operation with vec[i] = F(i) . It is executed as:
g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))" ./omp_test max threads: 12 serial reduction elapsed clock ticks: 23.9106 parallel reduction elapsed clock ticks: 3.35519 serial memset elapsed clock ticks: 43.7344 parallel memset elapsed clock ticks: 6.50351
Calculating the square root adds more work to threads, and finally there is some reasonable acceleration (we are talking about 7x , which makes sense, since hyper-threading cores share memory bands).