I have a simple program that runs the Monte Carlo algorithm. One iteration with the n algorithm without side effects, so I should be able to run it with multiple threads. So, this is an important part of my entire program , which is written in C ++ 11:
void task(unsigned int max_iter, std::vector<unsigned int> *results, std::vector<unsigned int>::iterator iterator) { for (unsigned int n = 0; n < max_iter; ++n) { nume::Album album(535); unsigned int steps = album.fill_up(); *iterator = steps; ++iterator; } } void aufgabe2() { std::cout << "\nAufgabe 2\n"; unsigned int max_iter = 10000; unsigned int thread_count = 4; std::vector<std::thread> threads(thread_count); std::vector<unsigned int> results(max_iter); std::cout << "Computing with " << thread_count << " threads" << std::endl; int i = 0; for (std::thread &thread: threads) { std::vector<unsigned int>::iterator start = results.begin() + max_iter/thread_count * i; thread = std::thread(task, max_iter/thread_count, &results, start); i++; } for (std::thread &thread: threads) { thread.join(); } std::ofstream out; out.open("out-2a.csv"); for (unsigned int count: results) { out << count << std::endl; } out.close(); std::cout << "Siehe Plot" << std::endl; }
The mysterious thing is that it gets slower the more threads I add. With 4 threads, I get the following:
real 0m5.691s user 0m3.784s sys 0m10.844s
If one thread:
real 0m1.145s user 0m0.816s sys 0m0.320s
I understand that moving data between processor cores can lead to overhead, but vector should be declared at startup and not changed in the middle. Is there any special reason for this slower on multiple cores?
My system is an i5-2550M that has 4 cores (2+ Hyperthreading) and I use g ++ (Ubuntu / Linaro 4.7.3-1ubuntu1) 4.7.3
Update
I saw that without threads (1) it will have a large user load, while with threads (2) it will have more kernel than user loading:
10K Runs:
http://wstaw.org/m/2013/05/08/stats3.png
100K Runs:
http://wstaw.org/m/2013/05/08/Auswahl_001.png
Current main.cpp
With the launch of 100K, I get the following:
Nothing exists:
real 0m28.705s user 0m28.468s sys 0m0.112s
Stream for each part of the program. These parts do not even use the same memory, so I concurrency for the same container should also be turned off. But it takes longer:
real 2m50.609s user 2m45.664s sys 4m35.772s
So, although the three main parts occupy 300% of my processor, they occupy 6 times more.
To complete 1M execution, real 4m45 . I used to run 1M, and it took at least real 20m , if not real 30m .