Speed increase despite false separation

Question

Speed increase despite false separation

I did some tests on OpenMP and did this program, which should not scale due to false exchange of the "sum" array. The problem is that it scales. Even “worse”:

with 1 thread: 4 seconds (icpc), 4 seconds (g ++)
with 2 threads: 2 seconds (icpc), 2 seconds (g ++)
with 4 threads: 0.5 seconds (icpc), 1 second (g ++)

I really don't get the acceleration that I get from 2 threads to 4 threads using Intel compilers. But most importantly: why is scaling so good, although it should show a false separation?

#include <iostream> #include <chrono> #include <array> #include <omp.h> int main(int argc, const char *argv[]) { const auto nb_threads = std::size_t{4}; omp_set_num_threads(nb_threads); const auto num_steps = std::size_t{1000000000}; const auto step = double{1.0 / num_steps}; auto sum = std::array<double, nb_threads>{0.0}; std::size_t actual_nb_threads; auto start_time = std::chrono::high_resolution_clock::now(); #pragma omp parallel { const auto id = std::size_t{omp_get_thread_num()}; if (id == 0) { // This is needed because OMP might give us less threads // than the numbers of threads requested actual_nb_threads = omp_get_num_threads(); } for (auto i = std::size_t{0}; i < num_steps; i += nb_threads) { auto x = double{(i + 0.5) * step}; sum[id] += 4.0 / (1.0 + x * x); } } auto pi = double{0.0}; for (auto id = std::size_t{0}; id < actual_nb_threads; id++) { pi += step * sum[id]; } auto end_time = std::chrono::high_resolution_clock::now(); auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count(); std::cout << "Pi: " << pi << std::endl; std::cout << "Time: " << time / 1.0e9 << " seconds" << std::endl; std::cout << "Total nb of threads actually used: " << actual_nb_threads << std::endl; return 0; }

+6

c ++ multithreading openmp false-sharing

Insideloop Jun 08 '15 at 9:07

source share

1 answer

Sneftel · Answer 1 · 2015-06-08T09:18:06+0000

This code can definitely show false sharing if the compiler decided to implement it that way. But that would be stupid for the compiler.

In the first loop, each thread accesses only one sum element. There is no reason num_steps writes to the actual stack the stack holding this element; it is much faster to just store the value in a register and write it after the for loop completes. Since the array is not mutable or atomic, there is nothing stopping the compiler from behaving this way.

And, of course, in the second loop there is no write to the array, so there is no exchange of lies.

Speed ​​increase despite false separation

More articles:

Speed increase despite false separation