OpenMP is awkwardly parallel to loop, without acceleration

I have, it seems, a very simple parallel for loop that simply writes zeros to an integer array. But it turns out that the more threads, the slower the cycle. I thought this was due to some caching, so I played with schedules, chunk size, __restrict__ , inserting the parallel inside the parallel block and resetting it. Then I noticed that reading an array to reduce is also slower.

It should be obvious very simple and should accelerate almost linearly. What am I missing here?

Full code:

 #include <omp.h> #include <vector> #include <iostream> #include <ctime> void tic(), toc(); int main(int argc, const char *argv[]) { const int COUNT = 100; const size_t sz = 250000 * 200; std::vector<int> vec(sz, 1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } toc(); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { double sum = 0; #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); std::cout << "serial memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } toc(); std::cout << "parallel memset" << std::endl; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } toc(); return 0; } static clock_t ClockCounter; void tic() { ClockCounter = std::clock(); } void toc() { ClockCounter = std::clock() - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; } 

Execution of this result:

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1790000 parallel reduction elapsed clock ticks: 19690000 serial memset elapsed clock ticks: 3860000 parallel memset elapsed clock ticks: 20800000 

If I run with -O2 , g ++ can optimize the sequential reduction, and I get zero time, so -O1 . In addition, the placement of omp_set_num_threads(1); makes times more similar, although there are still some differences:

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 1 serial reduction elapsed clock ticks: 1770000 parallel reduction elapsed clock ticks: 7370000 serial memset elapsed clock ticks: 2290000 parallel memset elapsed clock ticks: 3550000 

This should be pretty obvious, I feel like I don't see anything elementary. My processor is Intel (R) Xeon (R) CPU E5-2640 0 @ 2.50 GHz with a hyperthread, but the same behavior is observed in an i5 colleague with 4 cores and without a hyperthread. We both run Linux.

EDIT

It seems that one of the errors was aside from time while working with:

 static double ClockCounter; void tic() { ClockCounter = omp_get_wtime();//std::clock(); } void toc() { ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter; std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl; } 

gives more "reasonable" times:

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 ./omp_test max threads: 12 serial reduction elapsed clock ticks: 1.80974 parallel reduction elapsed clock ticks: 2.07367 serial memset elapsed clock ticks: 2.37713 parallel memset elapsed clock ticks: 2.23609 

But still there is no acceleration, it is simply not slower.

EDIT2

As suggested by user 8046, the code is highly memory related. and, as suggested by the Z-boson, the serial code is easily optimized, and it is not known what is measured here. Therefore, I made a small replacement for summing the sum outside the loop, so that it does not nullify at each iteration over c . I also replaced the reduction operation with sum+=F(vec[i]) and the memset operation with vec[i] = F(i) . It is executed as:

 g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))" ./omp_test max threads: 12 serial reduction elapsed clock ticks: 23.9106 parallel reduction elapsed clock ticks: 3.35519 serial memset elapsed clock ticks: 43.7344 parallel memset elapsed clock ticks: 6.50351 

Calculating the square root adds more work to threads, and finally there is some reasonable acceleration (we are talking about 7x , which makes sense, since hyper-threading cores share memory bands).

+6
source share
3 answers

You notice a sync error. There is still no acceleration, because both of your tests are associated with large memory. On typical consumer hardware, all of your cores have one memory bus, so using more threads does not give you more bandwidth and, as this is a bottleneck, acceleration. This will probably change if you reduce the size of the problem so that it fits into the cache, or if you increase the number of calculations by the data, for example, if you calculated the decrease exp (vec [i]) or 1 / vec [i]. For memset: you can saturate memory with one thread, you will never see acceleration there. (Only if you have access to the second memory bus with a large number of threads, as for some multi-network architectures). One remark about the reduction is most likely not implemented using locking, which would be terribly inefficient, but using the addition tree, which is not so bad for logarithmic acceleration.

+6
source

In addition to your mistake in using the clock function in Linux, the rest of your question can be answered by reading these questions / answers.

in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p / 11579987

measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

memset-in-parallel-with-threads-bound-to-each-physical-core

So, you should see a significant advantage from several streams with memset and make a reduction even in one socket system. I wrote my own bandwidth measurement tool for this. You can find some results of my mine i5-4250U (Haswell) with two cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) working on 1 GB. vsum is your shortcut. Please note that there is a significant improvement even in this dual-core system.

single thread

 C standard library GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.80 6.68 0.00 inf % memcpy: 1.00 1.35 7.93 0.00 inf % Agner Fog asmlib GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.71 7.53 0.00 inf % memcpy: 1.00 0.93 11.51 0.00 inf % my_memset 0.50 0.71 7.53 0.00 inf % FMA3 reduction tests GB time(s) GB/s GFLOPS efficiency vsum: 0.50 0.53 10.08 2.52 inf % vmul: 0.50 0.68 7.93 1.98 inf % vtriad: 0.50 0.70 7.71 3.85 inf % dot 1.00 1.08 9.93 2.48 inf % 

two threads

 C standard library GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.64 8.33 0.00 inf % memcpy: 1.00 1.10 9.76 0.00 inf % Agner Fog asmlib GB time(s) GB/s GFLOPS efficiency memset: 0.50 0.36 14.98 0.00 inf % memcpy: 1.00 0.66 16.30 0.00 inf % my_memset 0.50 0.36 15.03 0.00 inf % FMA3 tests standard sum tests with OpenMP: 2 threads GB time(s) GB/s GFLOPS efficiency vsum: 0.50 0.41 13.03 3.26 inf % vmul: 0.50 0.39 13.67 3.42 inf % vtriad: 0.50 0.44 12.20 6.10 inf % dot 1.00 0.97 11.11 2.78 inf % 

Here is my custom memset function (I have several other tests like this).

 void my_memset(int *s, int c, size_t n) { int i; __m128i v = _mm_set1_epi32(c); #pragma omp parallel for for(i=0; i<n/4; i++) { _mm_stream_si128((__m128i*)&s[4*i], v); } } 

Edit:

You must compile with -O3 and -ffast-math . Define the amount outside the external space, and then print it so that GCC does not optimize it. GCC will not automatically vectorize the abbreviation since floating point arithmetic is not associative, and loop vectorization may violate IEEE floating point rules. Using -ffast-math allows floating point arithmetic to be associative, allowing GCC to vectorize the abbreviation. It should be noted that the already abbreviated OpenMP suggests that floating point arithmetic is associative, so it already violates the IEEE floating point rules.

 double sum = 0; tic(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } toc(); printf("sum %f\n", sum); 

Edit:

I checked your code and made some changes. I get faster time with decrease and memset using multiple threads

 max threads: 4 serial reduction dtime 1.86, sum 705032704 parallel reduction dtime 1.39 s, sum 705032704 serial memset dtime 2.95 s parallel memset dtime 2.44 s serial my_memset dtime 2.66 s parallel my_memset dtime 1.35 s 

Here is the code I used (g ++ foo.cpp -fopenmp -O3 -ffast-math)

 #include <omp.h> #include <vector> #include <iostream> #include <ctime> #include <stdio.h> #include <xmmintrin.h> void my_memset(int *s, int c, size_t n) { int i; __m128i v = _mm_set1_epi32(c); for(i=0; i<n/4; i++) { _mm_stream_si128((__m128i*)&s[4*i], v); } } void my_memset_omp(int *s, int c, size_t n) { int i; __m128i v = _mm_set1_epi32(c); #pragma omp parallel for for(i=0; i<n/4; i++) { _mm_stream_si128((__m128i*)&s[4*i], v); } } int main(int argc, const char *argv[]) { const int COUNT = 100; const size_t sz = 250000 * 200; std::vector<int> vec(sz, 1); std::cout << "max threads: " << omp_get_max_threads()<< std::endl; std::cout << "serial reduction" << std::endl; double dtime; int sum; dtime = -omp_get_wtime(); sum = 0; for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) sum += vec[i]; } dtime += omp_get_wtime(); printf("dtime %.2f, sum %d\n", dtime, sum); int *const ptr = vec.data(); const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int std::cout << "parallel reduction" << std::endl; dtime = -omp_get_wtime(); sum = 0; for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) reduction(+:sum) for(int i = 0; i < sz_i; ++ i) sum += ptr[i]; } dtime += omp_get_wtime(); printf("dtime %.2f s, sum %d\n", dtime, sum); std::cout << "serial memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) { for(size_t i = 0; i < sz; ++ i) vec[i] = 0; } dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "parallel memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) { #pragma omp parallel for default(none) for(int i = 0; i < sz_i; ++ i) ptr[i] = 0; } dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "serial my_memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i); dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); std::cout << "parallel my_memset" << std::endl; dtime = -omp_get_wtime(); for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i); dtime += omp_get_wtime(); printf("dtime %.2f s\n", dtime); return 0; } 
+3
source

You are using std :: clock, which reports CPU time, not real time. Thus, each processor time is added up and will always be higher than single-threaded time (due to overhead).

http://en.cppreference.com/w/cpp/chrono/c/clock

+1
source

Source: https://habr.com/ru/post/974688/


All Articles