(update: the earlier version had incorrect indexing. This version is easily auto-hybridized).
Use a multidimensional C array (not an array of pointers to pointers) or a flat 1D array indexed with i*cols + j , so the memory is contiguous. This will significantly affect the efficiency of prefetching the equipment in order to effectively use the memory bandwidth. Loads with an address coming from another load really suck for performance, or, conversely, having predictable addresses known long before helps a lot, because loads can start long before they are needed (thanks to running out of turn).
Also, @ user31264's answer is correct, you need to swap loops, so j loop is the most external. This is good, but nowhere near is not enough.
It will also allow autoscrolling the compiler. In fact, it was surprisingly hard for me to get gcc for automatic vectorization. (But this is probably because I indexed incorrectly, because I only looked at the code for the first time. Therefore, the compiler did not know that we were sorting through continuous memory.)
I played with him in the Godbolt compiler explorer .
Finally, I got a good good compiler from this version, which takes A and C as flat 1D arrays and does the indexing myself:
void MatrixAddition_contiguous(int rows, int n, const vector<int>& list, const int *__restrict__ A, int *__restrict__ C, int cluster) // still auto-vectorizes without __restrict__, but checks for overlap and // runs a scalar loop in that case { const int cols = 4; // or global constexpr or something int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols; for(int i=0;i<rows;i++) //
In manual mode, list[j] definitely helped the compiler understand that stores in C cannot influence indexes that will be loaded from list[j] . Manual lifting of other material was probably not necessary.
Raising A[list[j]] , not just list[j] , is an artifact of the previous previous approach, in which I incorrectly indexed . While we are lifting the load from list[j] as far as possible, the compiler can do well even if it does not know that list does not overlap C
Inner loop targeting gcc 5.3 x86-64 -O3 -Wall -march=haswell -fopenmp (and -fverbose-asm ):
.L26: vmovdqu ymm0, YMMWORD PTR [r8+rax] # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B] vpaddd ymm0, ymm0, YMMWORD PTR [rdx+rax] # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B] add r12d, 1 # ivtmp.88, vmovdqu YMMWORD PTR [r8+rax], ymm0 # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75 add rax, 32 # ivtmp.91, cmp r12d, r9d # ivtmp.88, bnd.66 jb .L26 #,
Thus, it makes eight additions at once, with the AVX2 vpaddd , with unbalanced loads and invariable storage back to C.
Since this is auto-vectorization, it should make good code with ARM NEON, or PPC Altivec, or anything that a packaged 32-bit add-on can do.
I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2 , but clang -Rpass-analysis=loop-vectorize was a bit helpful.