Recently, I have been experimenting using a large number of random numbers to generate "normal distribution" curves.
The approach is simple:
- Create an array of integers and nullify it. (I use 2001 integers)
- Repeatedly calculate indices in this array and index this entry in the array as follows
- Cycle either 999 or 1000 times. At each iteration:
- Enter an array index with a central value (1000)
- Create a random number = + 1 / -1. and add it to the array index
- at the end of the loop, increase the value in the calculated index of the array.
Since random values ​​0/1 tend to occur approximately as often, the final index value from the inner loop above, as a rule, remains close to the central value. Index values ​​significantly larger / smaller than the initial value are becoming more and more unusual.
After a large number of repetitions, the values ​​in the array take the form of a normal distribution bell curve. However, the high-quality random function arc4random_uniform () that I use is rather slow, and iterations are required to generate a smooth curve.
I wanted to hire 1,000,000,000 (one billion) points. Starting up the main thread takes about 16 hours.
I decided to rewrite the calculation code to use dispatch_async and run it on my octa-core Mac Pro.
I ended up using dispatch_group_async () to send 8 blocks, with dispatch_group_notify () to notify the program when all the blocks finished processing.
For simplicity of the first pass, all 8 blocks are written to the same array of NSUInteger values. There is a small chance that the state of the read / change races is written to one of the entries in the array, but in this case it will simply lead to the loss of one value. I planned to add a lock to the array increment later (or maybe even create separate arrays in each block and then sum them after.)
In any case, I reorganized the code to use dispatch_group_async () and calculated 1/8 of the total values ​​in each block and disabled my code. To my full befuddlement, parallel code, while it allocates all the kernels on my Mac, MUCH works slower than single-threaded code.
When launched on a single thread, I get about 17,800 dots per second. When launched using dispatch_group_async, performance drops to about 665 points per second, or about 1/26 faster . I changed the number of blocks that I send - 2, 4 or 8, it does not matter. Performance is terrible. I also tried just sending all 8 blocks using dispatch_async without dispatch_group. It doesn’t matter either.
Currently, blocking / blocking does not occur: all blocks operate at full speed. I am completely surprised why parallel code is slower.
Now the code is a bit confusing because I reorganized it to work either single-threaded or at the same time so that I can check.
Here is the code that performs the calculations:
randCount = 2; #define K_USE_ASYNC 1 #if K_USE_ASYNC dispatch_queue_t highQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0); //dispatch_group_async dispatch_group_t aGroup = dispatch_group_create(); int totalJobs = 8; for (int i = 0; i<totalJobs; i++) { dispatch_group_async(aGroup, highQ, ^{ [self calculateNArrayPoints: KNumberOfEntries /totalJobs usingRandCount: randCount]; }); } dispatch_group_notify(aGroup, dispatch_get_main_queue(), allTasksDoneBlock ); #else [self calculateNArrayPoints: KNumberOfEntries usingRandCount: randCount]; allTasksDoneBlock(); #endif
And the general calculation method, which is used by both single-threaded and parallel versions:
+ (void) calculateNArrayPoints: (NSInteger) pointCount usingRandCount: (int) randCount; { int entry; int random_index; for (entry =0; entry<pointCount; entry++) { static int processed = 0; if (entry != 0 && entry%100000 == 0) { [self addTotTotalProcessed: processed]; processed = 0; } //Start with a value of 1000 (center value) int value = 0; //For each entry, add +/- 1 to the value 1000 times. int limit = KPinCount; if (randCount==2) if (arc4random_uniform(2) !=0) limit--; for (random_index = 0; random_index<limit; random_index++) { int random_value = arc4random_uniform(randCount); /* if 0, value-- if 1, no change if 2, value++ */ if (random_value == 0) value--; else if (random_value == randCount-1) value++; } value += 1000; _bellCurveData[value] += 1; //printf("\n\nfinal value = %d\n", value); processed++; } }
This is a quick and dirty training project. It works on both Mac and iOS, so it uses a common utility class. The utility class is nothing more than cool methods. There is no instance of the utilities method created. It has an embarrassing amount of global variables. If I end up doing something useful with the code, I will reorganize it to create a singleton utility, and convert all global variables to instance variables on singleton.
While this works, and the disgusting use of global variables does not affect the result, so I leave it alone.
Code that uses a "processed" variable is just a way to find out how many points were calculated when running in parallel. I added this code after I discovered the terrible performance of the parallel version, so I'm sure this is not the reason for the slowdown.
I'm here. I wrote a lot of parallel code, and this task is an " embarrassing parallel ", so there is no reason why it should not run when it is fully tilted on all available cores.
Anyone else see anything that might trigger this, or have any other ideas?