Bottleneck of access to parallel computing memory

Question

Bottleneck of access to parallel computing memory

The following algorithm runs iteratively in my program. running it without the two lines listed below takes 1.5X as long as it is absent. It is very surprising for me, as it is. The worst part is that working with these two lines increases completion to 4.4X work without them (6.6X the whole algorithm does not work). In addition, this leads to the fact that my program cannot scale beyond ~ 8 cores. In fact, when working on the same core, the two lines only increase the time to 1.7x, which is still too high, considering what they are doing. I ruled out that this is due to the influence of the changed data elsewhere in my program.

So I wonder what could be causing this. Is something like a cache possible?

void NetClass::Age_Increment(vector <synapse> & synapses, int k) { int size = synapses.size(); int target = -1; if(k > -1) { for(int q=0, x=0 ; q < size; q++) { if(synapses[q].active) synapses[q].age++; else { if(x==k)target=q; x++; } } /////////////////////////////////////Causing Bottleneck///////////// synapses[target].active = true; synapses[target].weight = .04 + (float (rand_r(seedp) % 17) / 100); //////////////////////////////////////////////////////////////////// } else { for(int q=0 ; q < size; q++) if(synapses[q].active) synapses[q].age++; } }

Update: Changing two problematic lines:

 bool x = true; float y = .04 + (float (rand_r(seedp) % 17) / 100);

Removes a problem. Suggest perhaps this is related to memory access?

+6

c ++ optimization multithreading memory

Matt Munson Nov 27 '11 at 5:39

source share

3 answers

If size is relatively small, it doesn’t surprise me at all that calling PRNG, integer division and float division and addition will significantly increase program execution. You do a lot of work, so it seems logical that this will increase the execution time. In addition, since you told the compiler to do the math as a float rather than a double , this can further increase the time on some systems (where the native floating point is double). Have you considered a fixed-point representation using int s?

I can’t say why this will get worse with a large number of cores if you do not exceed the number of cores that your program provided the OS (or if your rand_r system rand_r implemented using lock or thread data to maintain additional state).

Also note that you never check if target valid before using it as an array index, if it ever makes it from the for loop still set to -1, all bets are disabled for your program.

+2

Mark b Nov 27 '11 at 6:07

source share

Try retyping these two lines, but without rand_r , to make sure you have the same performance degradation. If you do not, this is probably a sign that rand_r serialized internally (e.g. via a mutex), so you will need to find a way to generate random numbers more in parallel.

Another potential problem is a false exchange (if you have time, take a look at the Herb Sutter video and the slides that cover this subject, among others). Essentially, if your threads modify different memory cells that are close enough to fall into the same cache line, caching coherence hardware can efficiently serialize memory access and disrupt scalability. What complicates the diagnosis is the fact that these memory cells can be logically independent and cannot be intuitively obvious that they were close to each other at run time. Try adding some add-ons to separate such memory locations from each other if you suspect false sharing.

+2

Branko dimitrijevic Nov 27 '11 at 6:13

source share

Remus Rusanu · Accepted Answer · 2011-11-27T08:28:04+0000

Each thread changes the memory of all the others read:

 for(int q=0, x=0 ; q < size; q++) if(synapses[q].active) ... // ALL threads read EVERY synapse.active ... synapses[target].active = true; // EVERY thread writes at leas one synapse.active

These types of reading and writing to the same address from different threads cause a lot of cache invalidation, which will lead to an accurate description of the symptoms. The solution is to avoid writing inside the loop, and the fact that moving the record to local variables, again, proves that the problem is invalid cache. Please note that even if you do not write a read normal field ( active ), you are likely to see the same symptoms due to a false exchange, since I suspect that active , age and weight share a cache line.

For more information, see CPU cache and why you care.

A final note is that assigning active and weight , not to mention the increment of age++ , all looks extremely unsafe. Blocked operations or lock / mutex protection for such updates would be mandatory.

Bottleneck of access to parallel computing memory

More articles: