2 threads slower than 1?

I played with std::thread and something strange came up:

 #include <thread> int k = 0; int main() { std::thread t1([]() { while (k < 1000000000) { k = k + 1; }}); std::thread t2([]() { while (k < 1000000000) { k = k + 1; }}); t1.join(); t2.join(); return 0; } 

When compiling the above code without optimization using clang ++, I got the following tests:

 real 0m2.377s user 0m4.688s sys 0m0.005s 

Then I changed my code to the following: (now using only 1 thread)

 #include <thread> int k = 0; int main() { std::thread t1([]() { while (k < 1000000000) { k = k + 1; }}); t1.join(); return 0; } 

And these were new landmarks:

 real 0m2.304s user 0m2.298s sys 0m0.003s 

Why is code using 2 threads slower than code using 1?

+6
source share
4 answers

You have two threads fighting for the same variable, k . So you spend time when the processors say, β€œProcessor 1: Hey, do you know what the value of k ? Processor 2: Of course, here you go!”, Ping-ponging back and forth every few updates. Since k not atomic, there is also no guarantee that thread2 does not write the "old" value of k , so the next time thread 1 reads the value, it bounces back 1, 2, 10 or 100 steps, and should do it again - theoretically , which can lead to the fact that none of the cycles will end, but this will require many failures.

+15
source

This should be a comment in response to Matz Petersson's answer, but I wanted to give code examples.

The problem lies in the competition of a particular resource, as well as in kelin.

Alternative 1:

 #include <cstdint> #include <thread> #include <vector> #include <stdlib.h> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([&k]() { // capture k by reference so we all use the same k. while (k < ITERATIONS) { k++; } }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; } 

Here the threads are fighting for one variable, performing both reading and writing, which makes it ping-pong cause competition and makes the single-threaded case the most efficient.

 #include <cstdint> #include <thread> #include <vector> #include <stdlib.h> #include <atomic> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::atomic<uint64_t> k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([&]() { // Imperfect division of labor, we'll fall short in some cases. for (size_t i = 0; i < ITERATIONS / numThreads; ++i) { k++; } }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; } 

Here we deterministically divide the work (we fall into cases where numThreads is not a ITERATOR divisor, but close enough for this demonstration). Unfortunately, we are still facing competition for access to a common element in memory.

 #include <cstdint> #include <thread> #include <vector> #include <stdlib.h> #include <atomic> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::vector<uint64_t> ks; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([=, &ks]() { auto& k = ks[t]; // Imperfect division of labor, we'll fall short in some cases. for (size_t i = 0; i < ITERATIONS / numThreads; ++i) { k++; } }); } uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); k += ks[t]; } return 0; } 

Again, this is determinative of the distribution of workload, and we make little effort at the end to compare the results. However, we did nothing to ensure that the distribution of the counters favorably affected the distribution of the central processor. For this:

 #include <cstdint> #include <thread> #include <vector> #include <stdlib.h> static const uint64_t ITERATIONS = 10000000000ULL; #define CACHE_LINE_SIZE 128 int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::mutex kMutex; uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([=, &k]() { alignas(CACHE_LINE_SIZE) uint64_t myK = 0; // Imperfect division of labor, we'll fall short in some cases. for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) { myK++; } kMutex.lock(); k += myK; kMutex.unlock(); }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; } 

Here we avoid competition between threads down to the level of the cache line, with the exception of a separate case at the end, where we use the mutex to control synchronization. For this trivial workload, the mutex will have one hell of a cost. Alternatively, you can use alignas to provide each thread with its own storage in the external area and summarize the results after merging, eliminating the need for a mutex. I leave this as an exercise for the reader.

+4
source

It seems to me a more important question than "why didn't it work?" "How do I get this to work?" For this task, I think that std::async (despite significant flaws ) is really a better tool than using std::thread directly.

 #include <future> #include <iostream> int k = 0; unsigned tasks = std::thread::hardware_concurrency(); unsigned reps = 1000000000 / tasks; int main() { std::vector<std::future<int>> f; for (int i=0; i<tasks; i++) f.emplace_back(std::async(std::launch::async, [](){int j; for (j=0; j<reps; j++); return j;}) ); for (int i=0; i<tasks; i++) { f[i].wait(); k += f[i].get(); } std::cout << k << "\n"; return 0; } 
+2
source

I ran into this problem. My opinion is that for a certain type of work, the cost of managing a thread can be more than the benefits of running in threads. Here is my sample code doing some real work in a loop with a lot of iterations, so I got a very consistent number with a time command.

  pair<int,int> result{0,0}; #ifdef USETHREAD thread thread_l(&Myclass::trimLeft, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.first)); thread thread_r(&Myclass::trimRight, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.second)); thread_l.join(); thread_r.join(); #else // non threaded version faster trimLeft(fsq, oriencnt, result.first); trimRight(fsq, oriencnt, result.second); #endif return result; 

Time results

 Thead No_thread =========================== Real 4m28s 2m49s usr 0m55s 2m49s sys 0m6.2s 0m0.012s 

I ignore the decimal value in seconds for large. My code updates only one common oriencnt variable. I have not yet let him update fsq. It seems that in the streaming version, the system does more work, which led to an increase in the operating time of the clock (in real time). My default compiler flag is -g -O2, not sure if this is a key issue or not. When compiling with -O3, the difference is minimal. There is also some Mutex controlled I / O. My experiment shows that this does not contribute to the difference. I am using gcc 5.4 with C ++ 11. One possibility is that the library is not optimized.

Here compiled with O3

  Thead No_thread ========================= real 4m24 2m44s usr 0m54s 2m44s sys 0m6.2s 0m0.016s 
0
source

Source: https://habr.com/ru/post/947417/


All Articles