How to add all elements to an array using SSE2?

Question

How to add all elements to an array using SSE2?

Suppose I have very simple code:

double array[SIZE_OF_ARRAY]; double sum = 0.0; for (int i = 0; i < SIZE_OF_ARRAY; ++i) { sum += array[i]; }

Basically, I want to perform the same operations using SSE2. How can i do this?

+4

c x86 sse simd sse2

Peter Lee 01 Oct '12 at 20:35

source share

1 answer

Paul r · Accepted Answer · 2012-10-01T22:24:25+0000

Here's a very simple implementation of SSE3:

 #include <emmintrin.h> __m128d vsum = _mm_set1_pd(0.0); for (int i = 0; i < n; i += 2) { __m128d v = _mm_load_pd(&a[i]); vsum = _mm_add_pd(vsum, v); } vsum = _mm_hadd_pd(vsum, vsum); double sum = _mm_cvtsd_f64(vsum0);

You can expand the loop to get much better performance using multiple batteries to hide the latency of adding FP (as suggested by @Mysticial). Expand 3 or 4 times with several “sums” of vectors to bottlenecks during loading and throughput of the FP add (one or two per cycle) instead of the delay of the FP add (one for 3 or 4 cycles):

 __m128d vsum0 = _mm_setzero_pd(); __m128d vsum1 = _mm_setzero_pd(); for (int i = 0; i < n; i += 4) { __m128d v0 = _mm_load_pd(&a[i]); __m128d v1 = _mm_load_pd(&a[i + 2]); vsum0 = _mm_add_pd(vsum0, v0); vsum1 = _mm_add_pd(vsum1, v1); } vsum0 = _mm_add_pd(vsum0, vsum1); // vertical ops down to one accumulator vsum0 = _mm_hadd_pd(vsum0, vsum0); // horizontal add of the single register double sum = _mm_cvtsd_f64(vsum0);

Note that the array a is considered equal to 16 bytes, and the number of elements n is considered a multiple of 2 (or 4, in the case of an expanded cycle).

See also The fastest way to make a horizontal vector sum float on x86 for alternative ways to execute a horizontal sum out of loop. Support for SSE3 is not completely universal (especially AMD processors subsequently supported it than Intel).

In addition, _mm_hadd_pd usually not the fastest way even for the processors supporting it, so the version for SSE2 will not be worse for modern processors. This is out of cycle and not a big deal anyway.

How to add all elements to an array using SSE2?

More articles: