The problem is using an array type .
Vector is a container. This is a structure that stores several data, such as size, start, end, etc .; and has several built-in functions, where the [] operator is one of them used to access data. As a result, cache lines that tend to load say for the index āiā of the vector V , load the element V [i] and some information that is not used in the code.
Conversely, if you use classic arrays (dynamic / static), the [] operator will load only data items. As a result, the cache line (usually 64 bytes) will load 8 elements of this double array (double size = 8 bytes).
See the difference between _mm_malloc and malloc for better data alignment.
@Mr Fooz I'm not sure about that. Let's compare the performance results for both cases:
4 threads on i7 processor
Array time: 0.122007 | Repeat: 4 | MFlops: 327.85
Vector time: 0.101006 | Repeat: 2 | MFlops: 188.669
I force the runtime to be greater than 0.1 s, so the code repeats. The main loop:
const int N = 10000000; timing(&wcs); for(; runtime < 0.1; repeat*=2) { for(int r = 0; r < repeat; r++) { #pragma omp parallel for for(int i = 0; i < N; i++) { A[i] += B[i]; } if(A[0]==0) dummy(A[0]); } timing(&wce); runtime = wce-wcs; }
MFLops: ((N * repeat) / runtime) / 1000000
source share