C # ThreadPool Implementation / Performance Spikes

In an attempt to speed up the processing of physical objects in C #, I decided to change the linear update algorithm to a parallel algorithm. I figured the best approach was to use ThreadPool, as it was built to complete the job queue.

When I first implemented the parallel algorithm, I queued up work for each physics object. Keep in mind that one task is performed quite quickly (updates forces, speed, position, checks for collisions with the old state of any surrounding objects to make it thread safe, etc.). I would wait for all jobs to be completed using one wait descriptor, with a blocked integer, which I decreased every time the physical object completed (when I reached zero, I then set the wait descriptor). Waiting was required as the next task I needed to do so that all objects were updated.

The first thing I noticed was insanity. When averaging, the thread pool seemed to go a little faster, but it had massive performance (about 10 ms for updating with random transitions up to 40-60 ms). I tried to talk about this using ANTS, but I could not understand why the spikes occurred.

My next approach was to still use ThreadPool, however, instead, I split all the objects into groups. At first, I started working with only 8 groups, as it was on all the cores of my computer. The performance was great. It far exceeded the single-threaded approach and did not have bursts (about 6 ms per update).

The only thing I was thinking about was that if one job were finished before the others, there would be an idle core. Therefore, I increased the number of jobs to about 20 and even to 500. As I expected, it dropped to 5 ms.

So my questions are:

  • Why did spikes arise when I quickly made a few jobs?
  • Any idea how ThreadPool is implemented that will help me understand how to best use it?
+4
source share
5 answers

Here are my answers to your two questions:

I would like to start with question 2 (how does the thread pool work), since it actually contains the key to answering question 1. The thread pool is implemented (without going into details) as a (thread-safe) work queue and a group of work threads (which can be reduced or increase as necessary). When the user calls QueueUserWorkItem , the task is placed in the work queue. Workers continue to poll the line and accept work if they are idle. As soon as they manage to complete the task, they complete it and then return to the queue for more work (this is very important!). Thus, the work is performed at the request of the workers: as the workers become idle, they do more work.

Having said the above, just see what is the answer to question 1 (why did you see the difference in performance with smaller-scale tasks): this is because with fine grain you get more load balancing (a very desirable property), i.e. your workers do more or less the same amount of work, and all cores are run evenly. As you said, with the distribution of tasks with large grains there can be longer and shorter tasks, so one or more cores can lag, slowing down the overall calculation, while others do nothing. With small tasks, the problem goes away. Each workflow performs one small task at a time, and then returns to the larger one. If one thread takes a shorter task, it will more often access the queue. If a longer task is required, it will alternate in the queue, so the elements are balanced .

Finally, when tasks are too small, and given that the pool can grow to more than 1,000 threads, there is very high competition in the queue when all threads return to do more work (which happens very often), which can explain the spikes that you see. If the base implementation uses a blocking lock to access the queue, then the context switches are very frequent, which is very harmful to performance and makes it rather random.

+2
source

Using threads has a price - you need a context switch, you need to block (the job queue is most likely blocked when the thread tries to get a new task) - all this comes at a price. This price is usually low compared to the actual work of your thread, but if the work ends quickly, the price becomes significant.

Your decision seems right. A reasonable rule of thumb is to have twice as many threads as cores.

+4
source

As you probably expect yourself, the spikes are most likely caused by code that manages thread pools and distributes tasks to them.

For parallel programming, there are more complex approaches than manually distributing work across different threads (even when using threadpool).

See Parallel Programming in the .NET Framework , for example, for an overview and various parameters. In your case, the β€œsolution” may be as simple as this:

 Parallel.ForEach(physicObjects, physicObject => Process(physicObject)); 
+3
source

the answer to question 1: this is due to thread switching, thread switching (or context switching in OS concepts) are the CPU clock pulses that are used to switch between each thread, in most cases multithreading increases the speed of programs and processes, but when the process is very small and then context switching will take longer than the self-processing process, so that the entire program throughput will decrease, you can find more information about this in the OS concept books.

answer to question 2: in fact, I have a complete understanding of ThreadPool, and I can not explain what kind of structure it is.

0
source

to learn more about ThreadPool start here ThreadPool Class

each version of the .NET Framework adds more features using ThreadPool indirectly. for example Parallel.ForEach Method , added to .NET 4 along with System.Threading.Tasks , which makes the code more readable and neat. Learn more about this here Task Schedulers .

At the most basic level, he does: he creates let say 20 threads and puts them in litas. Each time it receives a delegate to execute async, it takes an empty stream from the list and performs delegation. if there are no threads available, this puts it in the queue. every time the deletegate completes, it checks to see if there is any element in the queue, and if it looks one and executes in the same thread.

0
source

Source: https://habr.com/ru/post/1433161/


All Articles