Parallel code performance using dispatch_group_async is much slower than single-threaded version

Recently, I have been experimenting using a large number of random numbers to generate "normal distribution" curves.

The approach is simple:

  • Create an array of integers and nullify it. (I use 2001 integers)
  • Repeatedly calculate indices in this array and index this entry in the array as follows
    • Cycle either 999 or 1000 times. At each iteration:
      • Enter an array index with a central value (1000)
      • Create a random number = + 1 / -1. and add it to the array index
      • at the end of the loop, increase the value in the calculated index of the array.

Since random values ​​0/1 tend to occur approximately as often, the final index value from the inner loop above, as a rule, remains close to the central value. Index values ​​significantly larger / smaller than the initial value are becoming more and more unusual.

After a large number of repetitions, the values ​​in the array take the form of a normal distribution bell curve. However, the high-quality random function arc4random_uniform () that I use is rather slow, and iterations are required to generate a smooth curve.

I wanted to hire 1,000,000,000 (one billion) points. Starting up the main thread takes about 16 hours.

I decided to rewrite the calculation code to use dispatch_async and run it on my octa-core Mac Pro.

I ended up using dispatch_group_async () to send 8 blocks, with dispatch_group_notify () to notify the program when all the blocks finished processing.

For simplicity of the first pass, all 8 blocks are written to the same array of NSUInteger values. There is a small chance that the state of the read / change races is written to one of the entries in the array, but in this case it will simply lead to the loss of one value. I planned to add a lock to the array increment later (or maybe even create separate arrays in each block and then sum them after.)

In any case, I reorganized the code to use dispatch_group_async () and calculated 1/8 of the total values ​​in each block and disabled my code. To my full befuddlement, parallel code, while it allocates all the kernels on my Mac, MUCH works slower than single-threaded code.

When launched on a single thread, I get about 17,800 dots per second. When launched using dispatch_group_async, performance drops to about 665 points per second, or about 1/26 faster . I changed the number of blocks that I send - 2, 4 or 8, it does not matter. Performance is terrible. I also tried just sending all 8 blocks using dispatch_async without dispatch_group. It doesn’t matter either.

Currently, blocking / blocking does not occur: all blocks operate at full speed. I am completely surprised why parallel code is slower.

Now the code is a bit confusing because I reorganized it to work either single-threaded or at the same time so that I can check.

Here is the code that performs the calculations:

randCount = 2; #define K_USE_ASYNC 1 #if K_USE_ASYNC dispatch_queue_t highQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0); //dispatch_group_async dispatch_group_t aGroup = dispatch_group_create(); int totalJobs = 8; for (int i = 0; i<totalJobs; i++) { dispatch_group_async(aGroup, highQ, ^{ [self calculateNArrayPoints: KNumberOfEntries /totalJobs usingRandCount: randCount]; }); } dispatch_group_notify(aGroup, dispatch_get_main_queue(), allTasksDoneBlock ); #else [self calculateNArrayPoints: KNumberOfEntries usingRandCount: randCount]; allTasksDoneBlock(); #endif 

And the general calculation method, which is used by both single-threaded and parallel versions:

 + (void) calculateNArrayPoints: (NSInteger) pointCount usingRandCount: (int) randCount; { int entry; int random_index; for (entry =0; entry<pointCount; entry++) { static int processed = 0; if (entry != 0 && entry%100000 == 0) { [self addTotTotalProcessed: processed]; processed = 0; } //Start with a value of 1000 (center value) int value = 0; //For each entry, add +/- 1 to the value 1000 times. int limit = KPinCount; if (randCount==2) if (arc4random_uniform(2) !=0) limit--; for (random_index = 0; random_index<limit; random_index++) { int random_value = arc4random_uniform(randCount); /* if 0, value-- if 1, no change if 2, value++ */ if (random_value == 0) value--; else if (random_value == randCount-1) value++; } value += 1000; _bellCurveData[value] += 1; //printf("\n\nfinal value = %d\n", value); processed++; } } 

This is a quick and dirty training project. It works on both Mac and iOS, so it uses a common utility class. The utility class is nothing more than cool methods. There is no instance of the utilities method created. It has an embarrassing amount of global variables. If I end up doing something useful with the code, I will reorganize it to create a singleton utility, and convert all global variables to instance variables on singleton.

While this works, and the disgusting use of global variables does not affect the result, so I leave it alone.

Code that uses a "processed" variable is just a way to find out how many points were calculated when running in parallel. I added this code after I discovered the terrible performance of the parallel version, so I'm sure this is not the reason for the slowdown.

I'm here. I wrote a lot of parallel code, and this task is an " embarrassing parallel ", so there is no reason why it should not run when it is fully tilted on all available cores.

Anyone else see anything that might trigger this, or have any other ideas?

+4
source share
3 answers

arc4random uses a critical section when its state changes. The critical section is superfast in an unforeseen case (when changing from unlocked to locked), but in the case of criticism (when trying to lock a mutex that is already locked), it should call the operating system and put the thread to sleep, which significantly reduces performance.

 u_int32_t arc4random() { u_int32_t rnd; THREAD_LOCK(); arc4_check_init(); arc4_check_stir(); rnd = arc4_getword(&rs); THREAD_UNLOCK(); return (rnd); } 

where THREAD_LOCK() is defined as

 #define THREAD_LOCK() \ do { \ if (__isthreaded) \ _pthread_mutex_lock(&arc4random_mtx); \ } while (0) 

Source: Arc4 Random Number Generator for OpenBSD

to make it faster

you can create the arc4random class, which is a wrapper around the arc4_ * static functions from arc4random.c. Then you have a random number generator that is no longer thread safe, but you can create one generator for each thread.

+5
source

This is speculation, so I can’t confirm it anyway without profiling the code (how this happens).

However, arc4random blocks for each call, going Apple's source code collection . Since you are using arc4random_uniform potentially in multiple threads, you call it at least once, if not several times. Therefore, I’m best to say that each task expects all calls of other tasks on arc4random_uniform (or _uniform can, in turn, wait on its own if several calls are initiated in parallel, and several calls to arc4random ).

The easiest way to fix this may be to simply pull out the existing arc4random.c source code and modify it to either be wrapped in the class, removing the synchronization from it (as I suggested in the chat, or as Michael suggested), or use local thread storage (this eliminates the thread safety issue, but can be just as slow - I haven’t tried it myself, therefore, a mountain of salt). Keep in mind that if you go along the same route, you will need an alternative to accessing /dev/random on iOS. In this case, I recommend using SecRandomCopyBytes , since it should give the same or exactly the same good results as reading from /dev/random yourself.

So, although I’m sure that before arc4random , I can’t say for sure without profiling, because there may be other problems that cause performance problems even before arc4random its job.

+3
source

Ok, thanks to Michael and Noel for your thoughtful answers.

In fact, it seems that arc4random () and arc4random_uniform () use the spin_lock option, and the performance is terrible in multi-threaded use.

It makes sense that a spinlock is a really bad choice when there are many collisions, because a direct lock causes the thread to lock until the lock is released, thereby linking this core.

Ideal would be to create my own version of arc4random, which maintains its own array of states in instance variables and is not thread safe, would probably be a better solution. I would then reorganize my application to create a separate instance of my random generator for each thread.

However, this is a side project for my own research. This is more effort than I am willing to spend if I am not paid.

As an experiment, I replaced the code with rand (), and the single-threaded case was pretty fast, since rand () is a simpler and faster algorithm. Random numbers are also not so good. From what I read, rand () has problems with cyclic patterns in low bits, so instead of using typical rand ()% 2, I used rand ()% 0x4000 instead to use the second-highest bit order instead.

However, performance was still drastically declining when I tried to use rand () in multi-threaded code. It should also use a lock inside.

Then I switched to rand_r (), which takes a pointer to the initial value, assuming that since it has no state, it probably does not use locking.

Bingo. Now I get 415,674 points per second on my 8-core Mac Pro.

+1
source

Source: https://habr.com/ru/post/1207593/


All Articles