This should be a comment in response to Matz Petersson's answer, but I wanted to give code examples.
The problem lies in the competition of a particular resource, as well as in kelin.
Alternative 1:
#include <cstdint> #include <thread> #include <vector> #include <stdlib.h> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([&k]() { // capture k by reference so we all use the same k. while (k < ITERATIONS) { k++; } }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; }
Here the threads are fighting for one variable, performing both reading and writing, which makes it ping-pong cause competition and makes the single-threaded case the most efficient.
#include <cstdint> #include <thread> #include <vector> #include <stdlib.h> #include <atomic> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::atomic<uint64_t> k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([&]() { // Imperfect division of labor, we'll fall short in some cases. for (size_t i = 0; i < ITERATIONS / numThreads; ++i) { k++; } }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; }
Here we deterministically divide the work (we fall into cases where numThreads is not a ITERATOR divisor, but close enough for this demonstration). Unfortunately, we are still facing competition for access to a common element in memory.
#include <cstdint> #include <thread> #include <vector> #include <stdlib.h> #include <atomic> static const uint64_t ITERATIONS = 10000000000ULL; int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::vector<uint64_t> ks; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([=, &ks]() { auto& k = ks[t]; // Imperfect division of labor, we'll fall short in some cases. for (size_t i = 0; i < ITERATIONS / numThreads; ++i) { k++; } }); } uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); k += ks[t]; } return 0; }
Again, this is determinative of the distribution of workload, and we make little effort at the end to compare the results. However, we did nothing to ensure that the distribution of the counters favorably affected the distribution of the central processor. For this:
#include <cstdint> #include <thread> #include <vector> #include <stdlib.h> static const uint64_t ITERATIONS = 10000000000ULL; #define CACHE_LINE_SIZE 128 int main(int argc, const char** argv) { size_t numThreads = 1; if (argc > 1) { numThreads = strtoul(argv[1], NULL, 10); if (numThreads == 0) return -1; } std::vector<std::thread> threads; std::mutex kMutex; uint64_t k = 0; for (size_t t = 0; t < numThreads; ++t) { threads.emplace_back([=, &k]() { alignas(CACHE_LINE_SIZE) uint64_t myK = 0; // Imperfect division of labor, we'll fall short in some cases. for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) { myK++; } kMutex.lock(); k += myK; kMutex.unlock(); }); } for (size_t t = 0; t < numThreads; ++t) { threads[t].join(); } return 0; }
Here we avoid competition between threads down to the level of the cache line, with the exception of a separate case at the end, where we use the mutex to control synchronization. For this trivial workload, the mutex will have one hell of a cost. Alternatively, you can use alignas to provide each thread with its own storage in the external area and summarize the results after merging, eliminating the need for a mutex. I leave this as an exercise for the reader.