Accumulate whole vector with sse

Question

Accumulate whole vector with sse

I tried changing this code to handle std::vector<int> .

 float accumulate(const std::vector<float>& v) { // copy the length of v and a pointer to the data onto the local stack const size_t N = v.size(); const float* p = (N > 0) ? &v.front() : NULL; __m128 mmSum = _mm_setzero_ps(); size_t i = 0; // unrolled loop that adds up 4 elements at a time for(; i < ROUND_DOWN(N, 4); i+=4) { mmSum = _mm_add_ps(mmSum, _mm_loadu_ps(p + i)); } // add up single values until all elements are covered for(; i < N; i++) { mmSum = _mm_add_ss(mmSum, _mm_load_ss(p + i)); } // add up the four float values from mmSum into a single value and return mmSum = _mm_hadd_ps(mmSum, mmSum); mmSum = _mm_hadd_ps(mmSum, mmSum); return _mm_cvtss_f32(mmSum); }

Link: http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html

I changed _mm_setzero_ps to _mm_setzero_si128 , _mm_loadu_ps to mm_loadl_epi64 and _mm_add_ps to _mm_add_epi64 .

I get this error:

 error: cannot convert 'const int*' to 'const __m128i* {aka const __vector(2) long long int*}' for argument '1' to '__m128i _mm_loadl_epi64(const __m128i*)' mmSum = _mm_add_epi64(mmSum, _mm_loadl_epi64(p + i + 0));

I am new to this area. Is there a good source to study these things?

+5

c ++ x86 vector sse simd

user1436187 Oct 7 '15 at 11:32

source share

2 answers

The ideal way to do this is to let the compiler auto-vectorize your code and keep the code simple and straightforward. You ~~do not~~ need anything else to

 int sum = 0; for(int i=0; i<v.size(); i++) sum += v[i];

The link you pointed to, http://fastcpp.blogspot.com.au/2011/04/how-to-process-stl-vector-using-sse.html , does not seem to understand how to make the compiler vectorize code.

For the floating point that is used for this link, you need to know that floating point arithmetic is not associative and therefore depends on what order you are performing. GCC, MSVC , and Clang will not make an automatic vector for shrinking unless you tell him to use a different floating-point model, otherwise your result may depend on your hardware. ICC, however, uses associative floating point math by default, so it will vectorize the code, for example. -O3 .

Not only will GCC, MSVC, and Clang not be vectorized unless associative math is allowed, but they will not reverse the loop to allow partial sums to overcome the latency of summation. In this case, Clang and ICC will in any case be expanded to partial amounts. Clang unfolds four times and ICC twice .

One way to enable associative floating point arithmetic with GCC is to use the -Ofast flag. When using MSVC /fp:fast

I checked the code below with GCC 4.9.2, XeonE5-1620 (IVB) @ 3.60 GHz, Ubuntu 15.04.

 -O3 -mavx -fopenmp 0.93 s -Ofast -mavx -fopenmp 0.19 s -Ofast -mavx -fopenmp -funroll-loops 0.19 s

This is about five times faster. Although the GCC loops eight times, it does not execute independent partial sums (see Build below). For this reason, the deployed version is no better.

I used OpenMP only for a convenient cross-platform / compilation function: omp_get_wtime() .

Another advantage of auto-integration is that it works for AVX by simply turning on the compiler switch (e.g. -mavx ). Otherwise, if you want AVX, you will have to rewrite your code to use the built-in AVX tools, and you may have to ask another question about how to do this.

Thus, at present, the only compiler that will automatically vectorize your loop and also deploy up to four partial sums is Clang. See code and assembly at the end of this answer.

Here is the code I used to test the performance

 #include <stdio.h> #include <omp.h> #include <vector> float sumf(float *x, int n) { float sum = 0; for(int i=0; i<n; i++) sum += x[i]; return sum; } #define N 10000 // the link used this value int main(void) { std::vector<float> x; for(int i=0; i<N; i++) x.push_back(1 -2*(i%2==0)); //float x[N]; for(int i=0; i<N; i++) x[i] = 1 -2*(i%2==0); float sum = 0; sum += sumf(x.data(),N); double dtime = -omp_get_wtime(); for(int r=0; r<100000; r++) { sum += sumf(x.data(),N); } dtime +=omp_get_wtime(); printf("sum %f time %f\n", sum, dtime); }

Edit:

I had to take my own advice and look at the assembly.

The main loop for -O3 . It is clear that it has only a scalar sum.

 .L3: vaddss (%rdi), %xmm0, %xmm0 addq $4, %rdi cmpq %rax, %rdi jne .L3

The main loop for -Ofast . It performs a vector sum, but does not expand.

 .L8: addl $1, %eax vaddps (%r8), %ymm1, %ymm1 addq $32, %r8 cmpl %eax, %ecx ja .L8

Main loop for -O3 -funroll-loops . 8x Spread Vector

 .L8: vaddps (%rax), %ymm1, %ymm2 addl $8, %ebx addq $256, %rax vaddps -224(%rax), %ymm2, %ymm3 vaddps -192(%rax), %ymm3, %ymm4 vaddps -160(%rax), %ymm4, %ymm5 vaddps -128(%rax), %ymm5, %ymm6 vaddps -96(%rax), %ymm6, %ymm7 vaddps -64(%rax), %ymm7, %ymm8 vaddps -32(%rax), %ymm8, %ymm1 cmpl %ebx, %r9d ja .L8

Edit:

Entering the following code in Clang 3.7 ( -O3 -fverbose-asm -mavx )

 float sumi(int *x) { x = (int*)__builtin_assume_aligned(x, 64); int sum = 0; for(int i=0; i<2048; i++) sum += x[i]; return sum; }

creates the next assembly. Note that it is vectorized into four independent partial sums.

 sumi(int*): # @sumi(int*) vpxor xmm0, xmm0, xmm0 xor eax, eax vpxor xmm1, xmm1, xmm1 vpxor xmm2, xmm2, xmm2 vpxor xmm3, xmm3, xmm3 .LBB0_1: # %vector.body vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax] vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 16] vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 32] vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 48] vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax + 64] vpaddd xmm1, xmm1, xmmword ptr [rdi + 4*rax + 80] vpaddd xmm2, xmm2, xmmword ptr [rdi + 4*rax + 96] vpaddd xmm3, xmm3, xmmword ptr [rdi + 4*rax + 112] add rax, 32 cmp rax, 2048 jne .LBB0_1 vpaddd xmm0, xmm1, xmm0 vpaddd xmm0, xmm2, xmm0 vpaddd xmm0, xmm3, xmm0 vpshufd xmm1, xmm0, 78 # xmm1 = xmm0[2,3,0,1] vpaddd xmm0, xmm0, xmm1 vphaddd xmm0, xmm0, xmm0 vmovd eax, xmm0 vxorps xmm0, xmm0, xmm0 vcvtsi2ss xmm0, xmm0, eax ret

+4

Z boson Oct 7 '15 at 15:57

source share

Paul r · Accepted Answer · 2015-10-07T12:32:06+0000

Here is the int version that I was just dumping:

 #include <iostream> #include <vector> #include <smmintrin.h> // SSE4 #define ROUND_DOWN(m, n) ((m) & ~((n) - 1)) static int accumulate(const std::vector<int>& v) { // copy the length of v and a pointer to the data onto the local stack const size_t N = v.size(); const int* p = (N > 0) ? &v.front() : NULL; __m128i mmSum = _mm_setzero_si128(); int sum = 0; size_t i = 0; // unrolled loop that adds up 4 elements at a time for(; i < ROUND_DOWN(N, 4); i+=4) { mmSum = _mm_add_epi32(mmSum, _mm_loadu_si128((__m128i *)(p + i))); } // add up the four int values from mmSum into a single value mmSum = _mm_hadd_epi32(mmSum, mmSum); mmSum = _mm_hadd_epi32(mmSum, mmSum); sum = _mm_extract_epi32(mmSum, 0); // add up single values until all elements are covered for(; i < N; i++) { sum += p[i]; } return sum; } int main() { std::vector<int> v; for (int i = 0; i < 10; ++i) { v.push_back(i); } int sum = accumulate(v); std::cout << sum << std::endl; return 0; }

Compile and run:

 $ g++ -Wall -msse4 -O3 accumulate.cpp && ./a.out 45

Accumulate whole vector with sse

More articles: