Here's a very simple implementation of SSE3:
#include <emmintrin.h> __m128d vsum = _mm_set1_pd(0.0); for (int i = 0; i < n; i += 2) { __m128d v = _mm_load_pd(&a[i]); vsum = _mm_add_pd(vsum, v); } vsum = _mm_hadd_pd(vsum, vsum); double sum = _mm_cvtsd_f64(vsum0);
You can expand the loop to get much better performance using multiple batteries to hide the latency of adding FP (as suggested by @Mysticial). Expand 3 or 4 times with several βsumsβ of vectors to bottlenecks during loading and throughput of the FP add (one or two per cycle) instead of the delay of the FP add (one for 3 or 4 cycles):
__m128d vsum0 = _mm_setzero_pd(); __m128d vsum1 = _mm_setzero_pd(); for (int i = 0; i < n; i += 4) { __m128d v0 = _mm_load_pd(&a[i]); __m128d v1 = _mm_load_pd(&a[i + 2]); vsum0 = _mm_add_pd(vsum0, v0); vsum1 = _mm_add_pd(vsum1, v1); } vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register double sum = _mm_cvtsd_f64(vsum0);
Note that the array a is considered equal to 16 bytes, and the number of elements n is considered a multiple of 2 (or 4, in the case of an expanded cycle).
See also The fastest way to make a horizontal vector sum float on x86 for alternative ways to execute a horizontal sum out of loop. Support for SSE3 is not completely universal (especially AMD processors subsequently supported it than Intel).
In addition, _mm_hadd_pd usually not the fastest way even for the processors supporting it, so the version for SSE2 will not be worse for modern processors. This is out of cycle and not a big deal anyway.
source share