C, SSE2-point product and assembly gcc-O3

I need to write a point product using SSE2 (no _mm_dp_ps or _mm_hadd_ps):

#include <xmmintrin.h> inline __m128 sse_dot4(__m128 a, __m128 b) { const __m128 mult = _mm_mul_ps(a, b); const __m128 shuf1 = _mm_shuffle_ps(mult, mult, _MM_SHUFFLE(0, 3, 2, 1)); const __m128 shuf2 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(1, 0, 3, 2)); const __m128 shuf3 = _mm_shuffle_ps(mult,mult, _MM_SHUFFLE(2, 1, 0, 3)); return _mm_add_ss(_mm_add_ss(_mm_add_ss(mult, shuf1), shuf2), shuf3); } 

but I looked at the generated assembler with gcc 4.9 (experimental) -O3, and I get:

  mulps %xmm1, %xmm0 movaps %xmm0, %xmm3 //These lines movaps %xmm0, %xmm2 //have no use movaps %xmm0, %xmm1 //isn't it ? shufps $57, %xmm0, %xmm3 shufps $78, %xmm0, %xmm2 shufps $147, %xmm0, %xmm1 addss %xmm3, %xmm0 addss %xmm2, %xmm0 addss %xmm1, %xmm0 ret 

I am wondering why gcc copies xmm0 to xmm1, 2, and 3 ... Here is the code I use with the flag: -march = native (looks better)

  vmulps %xmm1, %xmm0, %xmm1 vshufps $78, %xmm1, %xmm1, %xmm2 vshufps $57, %xmm1, %xmm1, %xmm3 vshufps $147, %xmm1, %xmm1, %xmm0 vaddss %xmm3, %xmm1, %xmm1 vaddss %xmm2, %xmm1, %xmm1 vaddss %xmm0, %xmm1, %xmm0 ret 
+4
source share
4 answers

(In fact, and despite all the above answers, the answers that were given at the time this question was sent did not meet the expectations that I had. Here is the answer I was waiting for.)

SSE instruction

 shufps $IMM, xmmA, xmmB 

doesn't work like

 xmmB = f($IMM, xmmA) //set xmmB with xmmA words shuffled according to $IMM 

but how

 xmmB = f($IMM, xmmA, xmmB) //set xmmB with 2 words of xmmA and 2 words of xmmB according to $IMM 

this is why you need a copy of the mulps result from xmm0 to xmm1..3 .

+1
source

Here's a point product that uses only original SSE instructions, which also checks the result for each element:

 inline __m128 sse_dot4(__m128 v0, __m128 v1) { v0 = _mm_mul_ps(v0, v1); v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(2, 3, 0, 1)); v0 = _mm_add_ps(v0, v1); v1 = _mm_shuffle_ps(v0, v0, _MM_SHUFFLE(0, 1, 2, 3)); v0 = _mm_add_ps(v0, v1); return v0; } 

These are 5 SIMD instructions (as opposed to 7), although without the real ability to hide delays. Any element will hold the result, for example, float f = _mm_cvtss_f32(sse_dot4(a, b);

the haddps instruction has a pretty terrible delay. With SSE3:

 inline __m128 sse_dot4(__m128 v0, __m128 v1) { v0 = _mm_mul_ps(v0, v1); v0 = _mm_hadd_ps(v0, v0); v0 = _mm_hadd_ps(v0, v0); return v0; } 

This is probably slower, although these are just 3 SIMD instructions. If you can make more than one point product at a time, you can alternate the instructions in the first case. Shuffle runs very fast on later micro architectures.

+5
source

The first list you insert is for SSE architectures only. Most SSE instructions support only two operand syntaxes: instructions are in the form a = a OP b .

In your code, a is mult . Therefore, if a copy is not made and mult ( xmm0 in your example) is not transmitted directly, its value will be overwritten and then lost for the rest of the _mm_shuffle_ps commands

march=native skipping march=native in the second listing, you included the AVX instructions. AVX allows SSE commands to use three operand syntaxes: c = a OP b . In this case, none of the source operands should be overwritten, so you do not need additional copies.

+4
source

Let me assume that if you are going to use SIMD to create a point product, you are trying to find a way to work with multiple vectors at the same time. For example, using SSE, if you have four vectors and you want to take a point product with a fixed vector, then you order data like (xxxx), (yyyy), (zzzz), (wwww) and add each SSE vector and get The result of four point products at once. This will provide you with 100% efficiency (four times faster) and not limited to 4-component vectors, it is also 100% effective for n-component vectors. Here is an example that uses SSE.

 #include <xmmintrin.h> #include <stdio.h> void dot4x4(float *aosoa, float *b, float *out) { __m128 vx = _mm_load_ps(&aosoa[0]); __m128 vy = _mm_load_ps(&aosoa[4]); __m128 vz = _mm_load_ps(&aosoa[8]); __m128 vw = _mm_load_ps(&aosoa[12]); __m128 brod1 = _mm_set1_ps(b[0]); __m128 brod2 = _mm_set1_ps(b[1]); __m128 brod3 = _mm_set1_ps(b[2]); __m128 brod4 = _mm_set1_ps(b[3]); __m128 dot4 = _mm_add_ps( _mm_add_ps(_mm_mul_ps(brod1, vx), _mm_mul_ps(brod2, vy)), _mm_add_ps(_mm_mul_ps(brod3, vz), _mm_mul_ps(brod4, vw))); _mm_store_ps(out, dot4); } int main() { float *aosoa = (float*)_mm_malloc(sizeof(float)*16, 16); /* initialize array to AoSoA vectors v1 =(0,1,2,3}, v2 = (4,5,6,7), v3 =(8,9,10,11), v4 =(12,13,14,15) */ float a[] = { 0,4,8,12, 1,5,9,13, 2,6,10,14, 3,7,11,15, }; for (int i=0; i<16; i++) aosoa[i] = a[i]; float *out = (float*)_mm_malloc(sizeof(float)*4, 16); float b[] = {1,1,1,1}; dot4x4(aosoa, b, out); printf("%f %f %f %f\n", out[0], out[1], out[2], out[3]); _mm_free(aosoa); _mm_free(out); } 
+4
source

Source: https://habr.com/ru/post/1485184/


All Articles