What memory access patterns are most effective for double loops like an external product?

What access patterns are most effective for writing code such as an external product with a cache that maximizes data locality?

Consider a code block for processing all pairs of elements from two arrays, such as:

for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) out[i*M + j] = X[i] binary-op Y[j]; 

This is the standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but the same pattern is also matrix multiplication when X and Y are matrices and binary-op is the dot product between the i th row and j column of two matrices.

For matrix multiplication, I know that optimized BLAS, such as OpenBLAS and MKL, can get much better performance than you get from the double-loop style code above, since they process elements in chunks in this way, use the processor cache even more. Unfortunately, the OpenBLAS kernels are written in an assembly, so it’s pretty hard to understand what’s going on.

Are there any good trading tricks to reorganize these types of double loops to increase cache performance?

Since each out element occurs only once, we can freely change the order of iterations. A direct linear out walk is the easiest to write, but I don't think this is the most efficient pattern to execute since you are not using any locality in X

I’m particularly interested in a setting where M and N large, and the size of each element ( X[i] and Y[j] ) is quite small (for example, O (1) bytes), they said something similar to a vector-vector external product, or multiplying a high and skinny matrix by a short and fat matrix (for example, N x D by D x M , where D small).

+6
source share
1 answer

For large enough M vector Y will exceed the cache size L1. * Thus, at each new external iteration, you will reload Y from main memory (or at least a slower cache). In other words, you will not use time locality in Y

You must block access to Y ; something like that:

 for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks for (i = 0; i < N; i++) { for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block out[i*M + j] = X[i] * Y[j]; } } } 

The above doesn't do anything smart with access to X , but new values ​​are only available 1/CACHE_SIZE , so the impact is probably negligible.


* If everything is small enough to already enter the cache, then you can’t do better than what you already have (despite the possibility of vectorization).
+6
source

Source: https://habr.com/ru/post/971790/


All Articles