A loop. How to choose the block size?

I am trying to learn loop optimization. I found that tile shingles help speed up an array loop. I tried with the two code blocks below, with and without a cycle lock, and measures the time spent for both. In most cases, I did not find significant differences. I tested resizing a block, but I'm not sure how to select a block size. please help me if my direction is wrong. in fact, I found that a loop without a block runs faster many times.

a. With lock

int max = 1000000; int B = 100; for (i = 0; i < max; i += B) { for (j = i; j < (i + B); j++) { array[j] = 0; } } 

b. No lock

 for (i = 0; i < max; i++) { array[i] = 0; } 

time: with lock: elapsed time - 6997000 Nano Secs

No Elapsed Time Lock - 6097000 Nano Secs

+6
source share
1 answer

As stated here, tiling is a method that should support your working set inside caches while working on it in order to enjoy memory latency. If you transfer your data as soon as you see no benefit, since you will never get into the cache, so tiling will not be useful.

In your examples, the loops do exactly the same job in the same order, with the exception of adding another branch and creating slightly more complex branch patterns (most predictors will be able to handle this, it just doesn't help).

Consider the following example -

 #include "stdlib.h" #include "stdio.h" #include <time.h> #define MAX (1024*1024*32) #define REP 100 #define B (16*1024) int main() { int i,j,r; char array[MAX]; for (i = 0; i < MAX; i++) { // warmup to make things equal if array happens to fit in your L3 array[i] = 0; } clock_t t1 = clock(); // Tiled loop for (i = 0; i < MAX; i += B) { for (r = 0; r < REP; r++) { for (j = i; j < (i + B); j+=64) { array[j] = r; } } } clock_t t2 = clock(); // un-tiled loop for (r = 0; r < REP; r++) { for (i = 0; i < MAX; i+=64) { array[i] = r; } } clock_t t3 = clock(); printf ("Tiled: %f sec\n", (double)(t2 - t1) / CLOCKS_PER_SEC); printf ("Untiled: %f sec\n", (double)(t3 - t2) / CLOCKS_PER_SEC); printf ("array[0] = %d\n", array[0]); // to prevent optimizing out all the writes } 

Both loops perform the same access (jumping to 64 bytes means reinforcing caches using each cache line once and preventing IPC bottlenecks and command scheduling).

The tile version reorders these calls in blocks, so repeating one block can hit the cache several times. Since the block size is set to 16k, it will probably fit in most L1 caches and get really good latency. For each iteration of the outer loop, you will have 1 iteration, when you skip all the caches and go into memory (if your L3 is larger than 32M, just type MAX even higher to make sure), and REP-1 iterations that fly from L1.

The Uniled version will also repeat REPs as a whole, but each repeat will break all the data from the caches, doing all the memory accesses, accumulating to a much higher overall delay.

Compiling with gcc 4.8.1 (-O3) gives me on Xeon 5670 @ 2.9 GHz -

 Tiled: 0.110000 sec Untiled: 0.450000 sec array[0] = 99 

over 4x :)

Please note that the support version still has one advantage - there is one ordered stream, so the HW pre-set can trigger data pre-selection for you, which somewhat mitigates the effect of memory latency. However, you can help the processor do something similar in the banking version, if you add to the following:

 for (i = 0; i < MAX; i += B) { for (r = 0; r < REP; r++) { for (j = i; j < (i + B); j+=64) { array[j] = r; if (r == REP - 2) // SW prefetching __builtin_prefetch(&array[j+B], 0); } } } 

Tell the CPU to bring the next block a little ahead of the current one. For the price of a conditional branch (with several incorrect predictions per block), you reduce the time it takes to complete the first iteration on the next block - I get a further reduction from this:

 Tiled: 0.070000 sec 
+20
source

Source: https://habr.com/ru/post/959302/


All Articles