Why do eight processes with two threads create more load than one process with 16 threads?

I have a simple program that starts n threads and creates some load on each thread. If I run only one thread, one core gets about 100% of the load. If I start one process with 16 threads (which means one thread per core), I get about 80% of the load. If I started 8 processes with 2 threads (which still means one thread per core), I get about 99% of the load. I do not use lock in this example.

What is the reason for this behavior? I understand that the load drops if 100 threads are running, because the OS has to schedule a lot. But in this case there are only as many threads as there are cores.

This is even worse (for me, at least). If I add simple thread.sleep (0) to my loop, loading with one process and 16 threads will increase to 95%.

Can someone answer this or provide a link with more information about this particular topic?

One Process 16 threads

Eight process 2 threads

One Process 16 threads with thread.sleep (0)

//Sample application which reads the number of threads to be started from Console.ReadLine class Program { static void Main(string[] args) { Console.WriteLine("Enter the number of threads to be started"); int numberOfThreadsToStart; string input = Console.ReadLine(); int.TryParse(input, out numberOfThreadsToStart); if(numberOfThreadsToStart < 1) { Console.WriteLine("No valid number of threads entered. Exit now"); Thread.Sleep(1500); return; } List<Thread> threadList = new List<Thread>(); Stopwatch sw = Stopwatch.StartNew(); for (int i = 0; i < numberOfThreadsToStart; i++) { Thread workerThread = new Thread(MakeSomeLoad); workerThread.Start(); threadList.Add(workerThread); } while (true) { Console.WriteLine("I'm spinning... "); Thread.Sleep(2000); } } static void MakeSomeLoad() { for (int i = 0; i < 100000000; i++) { for (int j = 0; j < i; j++) { //uncomment the following line to increase the load //Thread.Sleep(0); StringBuilder sb = new StringBuilder(); sb.Append("hello world" + j); } } } } 
+4
source share
4 answers

Your test looks very hard. If you have 16 threads in one process, the GC will work more in this process, and since the client GC is not parallel, this leads to less load. those. you have 16 waste streams in the GC stream.

On the other hand, if you start 8 processes with two threads each, you get only two threads creating garbage for each GC thread, and the GC can run in parallel between these processes.

If you write a test that produces less garbage and uses more CPU, you are likely to get different results.

(Note that this is only an assumption, I did not run your test, and since I only have a dual-core processor, which in any case will be different from your results)

+6
source

Something else to consider is that there are different modes in the garbage collector:

  • GC Server
  • GC Workstation - Parallel (default execept for asp.net)
  • GC Workstation - Non Concurrent

You can find some of the graphic details of each here .

Since you process a lot of threads and allocate a whole bunch of memory, you should try the GC server.

The GC server is optimized for high throughput and high scalability in server applications where there is consistent load and requests allocating and freeing memory at high speed. The GC server uses one heap and one GC thread per processor and tries to balance the heaps as much as possible. During GC garbage collection, threads act on their respective threads and rendez-vous on specific points. Since they all work on their heaps, minimal blocking, etc. which makes it very effective in this type of situation.

You include the CG server in your App.config:

 <configuration> <runtime> <gcServer enabled="true" /> </runtime> </configuration> 

Please note that this will only work on a multiprocessor (or main) system. If Windows reports only one processor, instead you get a GC workstation - Non Concurrent.

+4
source

Use something like Thread.SpinWait(int.MaxValue) to load the processor, because your program basically creates a load on the memory, which can lead to effects such as false sharing. As already noted in CodeInChaos, GC activity will also greatly affect performance.

+1
source

Like others, I suspect this is due to the GC. The download example uses huge amounts of memory, by the end of two cycles for StringBuilder objects they will request gigabyte-sized arrays to store their data.

There are several reasons why a GC thread can slow down processing.

One of them is that as soon as the VM finishes working, most threads will be paused and wait for the GC to free memory before they can continue (this is because all threads will request more memory at about the same time at runtime).

Secondly, this is due to contextual thread switching (and this is probably the biggest reason). If thread A runs on core X, running out of memory, then GC will either boot to core X, or load all threads of memory A from core X cache to the cache on the kernel in which it is running. In any case, the CPU will have to wait until its cache with memory is loaded from RAM. RAM compared to the hard drive is fast, but compared to the processor, it is painstakingly slow. And while the processor is waiting for a RAM response, it cannot perform any processing, thereby reducing the load.

When you have multiple virtual machines, each virtual machine can run on its own core and does not care about what other virtual machines do before. And when the GC is called, then there is no need for a context switch, since the GC can only work on the same kernel as the other two threads on the virtual machine.

+1
source

Source: https://habr.com/ru/post/1397045/


All Articles