What access patterns are most effective for writing code such as an external product with a cache that maximizes data locality?
Consider a code block for processing all pairs of elements from two arrays, such as:
for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) out[i*M + j] = X[i] binary-op Y[j];
This is the standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but the same pattern is also matrix multiplication when X and Y are matrices and binary-op is the dot product between the i th row and j column of two matrices.
For matrix multiplication, I know that optimized BLAS, such as OpenBLAS and MKL, can get much better performance than you get from the double-loop style code above, since they process elements in chunks in this way, use the processor cache even more. Unfortunately, the OpenBLAS kernels are written in an assembly, so it’s pretty hard to understand what’s going on.
Are there any good trading tricks to reorganize these types of double loops to increase cache performance?
Since each out element occurs only once, we can freely change the order of iterations. A direct linear out walk is the easiest to write, but I don't think this is the most efficient pattern to execute since you are not using any locality in X
I’m particularly interested in a setting where M and N large, and the size of each element ( X[i] and Y[j] ) is quite small (for example, O (1) bytes), they said something similar to a vector-vector external product, or multiplying a high and skinny matrix by a short and fat matrix (for example, N x D by D x M , where D small).