Determining the optimal number of threads

So, as part of the school assignment, we are asked to determine what is our optimal thread count for our personal computers by building a toy program.

To get started, we need to create a task that takes 20 to 30 seconds to run. I decided to make a coin simulation where the total number of goals and tails is accumulated and then displayed. On my machine, 300 million rolls on one thread ended after 25 seconds. After that I went to 2 threads, then 4, then 8, 16, 32 and, just for fun, 100. Here are the results:

* Thread Tosses per thread time(seconds) * ------------------------------------------ * 1 300,000,000 25 * 2 150,000,000 13 * 4 75,000,000 13 * 8 37,500,000 13 * 16 18,750,000 14 * 32 9,375,000 14 * 100 3,000,000 14

And here is the code I'm using:

 void toss() { int heads = 0, tails = 0; default_random_engine gen; uniform_int_distribution<int> dist(0,1); int max =3000000; //tosses per thread for(int x = 0; x < max; ++x){(dist(gen))?++heads:++tails;} cout<<heads<<" "<<tails<<endl; } int main() { vector<thread>thr; time_t st, fin; st = time(0); for(int i = 0;i < 100;++i){thr.push_back(thread(toss));} //thread count for(auto& thread: thr){thread.join();} fin = time(0); cout<<fin-st<<" seconds\n"; return 0; } 

Now for the main question:

Some time has passed, I would expect a significant decrease in computational speed as more threads are added, but the results do not seem to show this.

Is there something fundamentally wrong in my code that will produce such results, or is this behavior considered normal? I am very new to multithreading, so I feel like it was before ....

Thanks!

EDIT: I am running this on a macbook with a 2.16 GHz Core 2 Duo processor (T7400)

+5
source share
2 answers

Your results seem very normal to me. Although thread creation has a cost, it is not much (especially compared to the granularity of your tests per second). An additional 100 threads, destruction and possible context switches will not change your time by more than a few milliseconds.

Running on my Intel i7-4790 @ 3.60 GHz I get these numbers:

 threads - seconds ----------------- 1 - 6.021 2 - 3.205 4 - 1.825 8 - 1.062 16 - 1.128 32 - 1.138 100 - 1.213 1000 - 2.312 10000 - 23.319 

It takes many, many more threads to get to the point where additional threads have a noticeable difference. Only when I get up to 1000 threads, I see that the flow control has a significant difference, and at 10000 it eclipses the loop (the loop only makes 30,000 frames at this point).

According to your purpose, it should be pretty simple to make sure that the optimal number of threads for your system should be the same as the available threads that can be executed immediately. There is no processing power left to execute another thread until it is executed or executed, which will not help you finish faster. And, there are fewer threads and you are not using all available resources. My processor has 8 threads, and the diagram reflects this.

+9
source

Edit 2 - Further development of the β€œlack of performance penalty” part due to popular demand:

... I would expect a significant reduction in the number of calculations to occur as more threads are added, but the results do not seem to show that.

I made this giant diagram to better illustrate the scaling.

enter image description here

To explain the results:

The blue bar illustrates the total execution time of all throws. Although this time is reduced to 256 threads, the gains from doubling the number of threads are getting smaller and smaller. The processor on which I ran this test had 4 physical and 8 logical cores. Scaling is pretty good up to 4 cores and decent up to 8 cores, and then it drops sharply. The saturation of the pipeline allows you to get insignificant gains up to 256, but it's just not worth it.

The red bar illustrates the time for each throw. It is almost identical for 1 and 2 threads, since the CPU pipeline has not yet reached full saturation. It receives a minor hit in 4 threads, it still works fine, but now the pipeline is saturated, and in 8 threads it really shows that logical flows are not the same as physical ones, which is getting worse and worse than 8 threads.

The green bar illustrates overhead, or how much lower actual performance compared to the expected double increase from doubling flows. Clicking on the available logical cores leads to a rapid increase in overhead. Please note that this is mainly thread synchronization, the actual costs of thread scheduling are probably constant after a given point, there is a minimal window of activity time that a thread should receive, which explains why switching threads does not reach overwhelming performance. In fact, there is no serious decrease in performance up to 4k streams, which is expected since modern systems should be able and often run more than a thousand streams in parallel. And again, most of this fall is due to thread synchronization, not thread switching.

The black outline diagram shows the time difference with respect to the smallest time. In 8 threads, we lose almost 14% of the absolute productivity due to the lack of supersaturation of the pipeline, which is good, because in most cases it is not worth emphasizing the whole system. It also shows that 1 thread is only ~ 6 times slower than the maximum that the CPU can output. This gives an indication of how good logical cores are compared with physical cores, 100% of additional logical cores give a 50% increase in performance, in this case the logical flow is ~ 50%, like the physical flow, which also correlates with an increase of ~ 47% , which we see starting from 4 to 8. Note that this is a very simple workload, although in more complex cases it is close to 20-25% for this particular processor, and in some cases the extreme load is a hit.


Edit 1 - I foolishly forgot to isolate the computational load from the thread synchronization workload.

Running a test with a small amount of work does not show that for a large number of threads, the number of flow controls takes most of the time. Thus, the penalty of switching the thread is really very small and, possibly, after a certain point is a constant.

And that will make a lot of sense if you put yourself in the shoes of the creator of the thread scheduler. The scheduler can be easily protected from clogging by an unreasonably high switch-to-work ratio, therefore, probably, the minimum time that the scheduler will allow the thread before switching to another while the rest will be put on hold. This ensures that the transition to a working ratio will never exceed the reasonable limits. It would be much better to stop other threads than to go crazy while switching threads, as the processor will basically switch and do very little actual work.


The optimal number of threads is the number of logical processor cores available. This ensures optimum pipeline saturation.

If you use more, you will suffer performance degradation due to the cost of switching the context of the thread. The more threads, the greater the penalty.

If you use less, you will not use the full hardware potential.

There is also the problem of scaling the workload, which is very important when you use synchronization, such as a mutex. If your concurrency is too fine, you can experience a drop in performance even when switching from 1 to 2 threads on a machine with 8 threads. You would like to reduce synchronization as little as possible, doing as much work as possible between synchronization, otherwise you may face big performance losses.

Note the difference between the physical and logical core of the CPU. Hyper-thread processors can have more than one logical core per physical core. Secondary logical cores do not have the same processing power as the primary, since they are simply used to use vacancies in the use of the processor pipeline.

For example, if you have a quad-core processor with four cores, in the case of a perfectly scalable workload, you will see a performance increase of 4 times from 1 to 4 threads, but much less, starting from 4 to 8 threads, as can be seen from the vu1p3n0x answer.

Here you can look to determine the number of processor cores available.

+4
source

Source: https://habr.com/ru/post/1260109/


All Articles