The ideal way to do this is to let the compiler auto-vectorize your code and keep the code simple and straightforward. You do not need anything else to
int sum = 0; for(int i=0; i<v.size(); i++) sum += v[i];
The link you pointed to, http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html , does not seem to understand how to make the compiler vectorize code.
For the floating point that is used for this link, you need to know that floating point arithmetic is not associative and therefore depends on what order you are performing. GCC, MSVC , and Clang will not make an automatic vector for shrinking unless you tell him to use a different floating-point model, otherwise your result may depend on your hardware. ICC, however, uses associative floating point math by default, so it will vectorize the code, for example. -O3 .
Not only will GCC, MSVC, and Clang not be vectorized unless associative math is allowed, but they will not reverse the loop to allow partial sums to overcome the latency of summation. In this case, Clang and ICC will in any case be expanded to partial amounts. Clang unfolds four times and ICC twice .
One way to enable associative floating point arithmetic with GCC is to use the -Ofast flag. When using MSVC /fp:fast
I checked the code below with GCC 4.9.2, XeonE5-1620 (IVB) @ 3.60 GHz, Ubuntu 15.04.
-O3 -mavx -fopenmp 0.93 s -Ofast -mavx -fopenmp 0.19 s -Ofast -mavx -fopenmp -funroll-loops 0.19 s
This is about five times faster. Although the GCC loops eight times, it does not execute independent partial sums (see Build below). For this reason, the deployed version is no better.
I used OpenMP only for a convenient cross-platform / compilation function: omp_get_wtime() .
Another advantage of auto-integration is that it works for AVX by simply turning on the compiler switch (e.g. -mavx ). Otherwise, if you want AVX, you will have to rewrite your code to use the built-in AVX tools, and you may have to ask another question about how to do this.
Thus, at present, the only compiler that will automatically vectorize your loop and also deploy up to four partial sums is Clang. See code and assembly at the end of this answer.
Here is the code I used to test the performance
#include <stdio.h> #include <omp.h> #include <vector> float sumf(float *x, int n) { float sum = 0; for(int i=0; i<n; i++) sum += x[i]; return sum; } #define N 10000 // the link used this value int main(void) { std::vector<float> x; for(int i=0; i<N; i++) x.push_back(1 -2*(i%2==0)); //float x[N]; for(int i=0; i<N; i++) x[i] = 1 -2*(i%2==0); float sum = 0; sum += sumf(x.data(),N); double dtime = -omp_get_wtime(); for(int r=0; r<100000; r++) { sum += sumf(x.data(),N); } dtime +=omp_get_wtime(); printf("sum %f time %f\n", sum, dtime); }
Edit:
I had to take my own advice and look at the assembly.
The main loop for -O3 . It is clear that it has only a scalar sum.
.L3: vaddss (%rdi), %xmm0, %xmm0 addq $4, %rdi cmpq %rax, %rdi jne .L3
The main loop for -Ofast . It performs a vector sum, but does not expand.
.L8: addl $1, %eax vaddps (%r8), %ymm1, %ymm1 addq $32, %r8 cmpl %eax, %ecx ja .L8
Main loop for -O3 -funroll-loops . 8x Spread Vector
.L8: vaddps (%rax), %ymm1, %ymm2 addl $8, %ebx addq $256, %rax vaddps -224(%rax), %ymm2, %ymm3 vaddps -192(%rax), %ymm3, %ymm4 vaddps -160(%rax), %ymm4, %ymm5 vaddps -128(%rax), %ymm5, %ymm6 vaddps -96(%rax), %ymm6, %ymm7 vaddps -64(%rax), %ymm7, %ymm8 vaddps -32(%rax), %ymm8, %ymm1 cmpl %ebx, %r9d ja .L8
Edit:
Entering the following code in Clang 3.7 ( -O3 -fverbose-asm -mavx )
float sumi(int *x) { x = (int*)__builtin_assume_aligned(x, 64); int sum = 0; for(int i=0; i<2048; i++) sum += x[i]; return sum; }
creates the next assembly. Note that it is vectorized into four independent partial sums.
sumi(int*): # @sumi(int*) vpxor xmm0, xmm0, xmm0 xor eax, eax vpxor xmm1, xmm1, xmm1 vpxor xmm2, xmm2, xmm2 vpxor xmm3, xmm3, xmm3 .LBB0_1: # %vector.body vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax] vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 16] vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 32] vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 48] vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax + 64] vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 80] vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 96] vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 112] add rax, 32 cmp rax, 2048 jne .LBB0_1 vpaddd xmm0, xmm1, xmm0 vpaddd xmm0, xmm2, xmm0 vpaddd xmm0, xmm3, xmm0 vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddd xmm0, xmm0, xmm1 vphaddd xmm0, xmm0, xmm0 vmovd eax, xmm0 vxorps xmm0, xmm0, xmm0 vcvtsi2ss xmm0, xmm0, eax ret