The program runs slower when using multiple threads

I have a simple program that runs the Monte Carlo algorithm. One iteration with the n algorithm without side effects, so I should be able to run it with multiple threads. So, this is an important part of my entire program , which is written in C ++ 11:

void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) { for (unsigned int n = 0; n < max_iter; ++n) { nume::Album album(535); unsigned int steps = album.fill_up(); *iterator = steps; ++iterator; } } void aufgabe2() { std::cout << "\nAufgabe 2\n"; unsigned int max_iter = 10000; unsigned int thread_count = 4; std::vector<std::thread> threads(thread_count); std::vector<unsigned int> results(max_iter); std::cout << "Computing with " << thread_count << " threads" << std::endl; int i = 0; for (std::thread &thread: threads) { std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i; thread = std::thread(task, max_iter/thread_count, &results, start); i++; } for (std::thread &thread: threads) { thread.join(); } std::ofstream out; out.open("out-2a.csv"); for (unsigned int count: results) { out << count << std::endl; } out.close(); std::cout << "Siehe Plot" << std::endl; } 

The mysterious thing is that it gets slower the more threads I add. With 4 threads, I get the following:

 real 0m5.691s user 0m3.784s sys 0m10.844s 

If one thread:

 real 0m1.145s user 0m0.816s sys 0m0.320s 

I understand that moving data between processor cores can lead to overhead, but vector should be declared at startup and not changed in the middle. Is there any special reason for this slower on multiple cores?

My system is an i5-2550M that has 4 cores (2+ Hyperthreading) and I use g ++ (Ubuntu / Linaro 4.7.3-1ubuntu1) 4.7.3

Update

I saw that without threads (1) it will have a large user load, while with threads (2) it will have more kernel than user loading:

10K Runs:

http://wstaw.org/m/2013/05/08/stats3.png

100K Runs:

http://wstaw.org/m/2013/05/08/Auswahl_001.png

Current main.cpp

With the launch of 100K, I get the following:

Nothing exists:

 real 0m28.705s user 0m28.468s sys 0m0.112s 

Stream for each part of the program. These parts do not even use the same memory, so I concurrency for the same container should also be turned off. But it takes longer:

 real 2m50.609s user 2m45.664s sys 4m35.772s 

So, although the three main parts occupy 300% of my processor, they occupy 6 times more.

To complete 1M execution, real 4m45 . I used to run 1M, and it took at least real 20m , if not real 30m .

+6
source share
2 answers

Appreciated your current main.cpp on GitHub. In addition to the comments above:

  • Yes, rand () is not thread safe, so before starting a multi-threaded business logic it may be advisable to pre-populate some array with random values ​​(this way you will reduce the number of possible locks). The same goes for memory allocation if you plan to do some work with the heap (pre-allocate before multi-threaded or use a dedicated allocator for threads).
  • Do not forget about other processes. If you plan to use 4 threads on 4 cores, this means that you will compete with other software (at least OS routines) for CPU resources.
  • The file output is a big player in the locker. You make a "<<" statement at each iteration of the loop, and it is very expensive (I remember one funny case in my past: to make the log-out one fixed one multi-threaded bug, indirectly). Since the common logger is controlled by locking, this is a kind of synchronization primitive, you know!).
  • Finally, there is no guarantee that a multi-threaded application can be faster than a single-threaded application. There are many processor-specific, environment-related, etc. Aspects.
+5
source

The results of a vector object are shared by all created threads, so even if your problem is awkwardly parallel, due to a common object, there is a statement that you should not mention cache misses (I'm not good enough to explain about caches on modern architectures). you should probably have n result vectors for your n threads, and finally merge the result. I think it will speed it up.

Another issue worth mentioning is to use std :: async whenever possible, instead of a stream. It handles thread distribution and other low-level riots. I read it from Scott Mayer with an effective c ++ 11 book. However, using threads, you can set the binding of threads to a specific core. So, if your processor supports 8 threads, you can create 8 threads and assign each thread to each core, at least on linux.

+1
source

Source: https://habr.com/ru/post/944506/


All Articles