A faster way to do adding a multidimensional matrix?

I have a matrix A of size (m * l * 4), and the size of m is about 100,000 and l = 100. The size of the list is always n and n <= m. I wanted to make a matrix addition to the specified index list. I wrote this function and have to call this function many times.

void MatrixAddition(int l, int n, vector<int>& list, int ***A,int ***C,int cluster) { for(int i=0;i<l;i++) { for(int j=0;j<4;j++) C[cluster][i][j]=0; } for (int i = 0; i < l; i++) { for(int j=0;j<n;++j) { for(int k=0;k<4;k++) C[cluster][i][k]+=A[list[j]][i][k]; } } } 

I use gprof to calculate how much time each part of the function in the whole code takes, and I found 60% of the time spent by the MatrixAddition function. Is there an alternative way to write this function to reduce execution time.

seconds seconds seconds ms / call ms / call name
52.00 7.85 7.85 20 392.60 405.49 MatrixAddition (int, int, std :: vector> &, int ***, int ***, int)

0
c ++ optimization gprof
May 14 '16 at 22:15
source share
2 answers

The exchange cycle in i and cycle j in the second part. This will make the function more convenient for caching.

 for(int j=0;j<n;++j) { for (int i = 0; i < l; i++) { for(int k=0;k<4;k++) C[cluster][i][k]+=A[list[j]][i][k]; } } 

Also, I hope you have not forgotten the OO flag.

0
May 14 '16 at 23:28
source share

(update: the earlier version had incorrect indexing. This version is easily auto-hybridized).

Use a multidimensional C array (not an array of pointers to pointers) or a flat 1D array indexed with i*cols + j , so the memory is contiguous. This will significantly affect the efficiency of prefetching the equipment in order to effectively use the memory bandwidth. Loads with an address coming from another load really suck for performance, or, conversely, having predictable addresses known long before helps a lot, because loads can start long before they are needed (thanks to running out of turn).

Also, @ user31264's answer is correct, you need to swap loops, so j loop is the most external. This is good, but nowhere near is not enough.

It will also allow autoscrolling the compiler. In fact, it was surprisingly hard for me to get gcc for automatic vectorization. (But this is probably because I indexed incorrectly, because I only looked at the code for the first time. Therefore, the compiler did not know that we were sorting through continuous memory.)




I played with him in the Godbolt compiler explorer .

Finally, I got a good good compiler from this version, which takes A and C as flat 1D arrays and does the indexing myself:

 void MatrixAddition_contiguous(int rows, int n, const vector<int>& list, const int *__restrict__ A, int *__restrict__ C, int cluster) // still auto-vectorizes without __restrict__, but checks for overlap and // runs a scalar loop in that case { const int cols = 4; // or global constexpr or something int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols; for(int i=0;i<rows;i++) //#pragma omp simd for(int k=0;k<4;k++) Ccluster[cols*i + k]=0; for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols; // #pragma omp simd // Doesn't work: only auto-vectorizes with -O3 // probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop for (int i = 0; i < rows; i++) { long row_offset = cols*i; //#pragma omp simd // forces vectorization with 16B vectors, so it hurts AVX2 for(int k=0;k<4;k++) Ccluster[row_offset + k] += Alistj[row_offset + k]; } } } 

In manual mode, list[j] definitely helped the compiler understand that stores in C cannot influence indexes that will be loaded from list[j] . Manual lifting of other material was probably not necessary.

Raising A[list[j]] , not just list[j] , is an artifact of the previous previous approach, in which I incorrectly indexed . While we are lifting the load from list[j] as far as possible, the compiler can do well even if it does not know that list does not overlap C

Inner loop targeting gcc 5.3 x86-64 -O3 -Wall -march=haswell -fopenmp (and -fverbose-asm ):

 .L26: vmovdqu ymm0, YMMWORD PTR [r8+rax] # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B] vpaddd ymm0, ymm0, YMMWORD PTR [rdx+rax] # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B] add r12d, 1 # ivtmp.88, vmovdqu YMMWORD PTR [r8+rax], ymm0 # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75 add rax, 32 # ivtmp.91, cmp r12d, r9d # ivtmp.88, bnd.66 jb .L26 #, 

Thus, it makes eight additions at once, with the AVX2 vpaddd , with unbalanced loads and invariable storage back to C.

Since this is auto-vectorization, it should make good code with ARM NEON, or PPC Altivec, or anything that a packaged 32-bit add-on can do.

I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2 , but clang -Rpass-analysis=loop-vectorize was a bit helpful.

0
May 15 '16 at 7:08 a.m.
source share



All Articles