Download and add SSE

Suppose I have two vectors, represented by two arrays of type double , each of which is 2. I would like to add the corresponding positions. So, suppose that the vectors i0 and i1 , I would like to add together i0[0] + i1[0] and i0[1] + i1[1] .

Since the type is double , I need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] into another and just add the register with you.

My question is: if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]) , will it place the lower and upper 64-bit separately or replace the register with the second load ? How would I put both doubles in the same register, so I can call add_ps after?

Thanks,

+4
source share
2 answers

I think you want:

 double i0[2]; double i1[2]; __m128d x1 = _mm_load_pd(i0); __m128d x2 = _mm_load_pd(i1); __m128d sum = _mm_add_pd(x1, x2); // do whatever you want to with "sum" now 

When you do _mm_load_pd , it puts the first double bit in the lower 64 bits of the register, and the second in the upper 16 bits. Thus, after loads above x1 contains two double values i0[0] and i0[1] (and similar for x2 ). Calling _mm_add_pd vertically adds the corresponding elements to x1 and x2 , so after adding sum contains i0[0] + i1[0] in its lower 64 bits and i0 i0[1] + i1[1] in its upper 64 bits.

Edit: I must point out that there is no benefit in using _mm_load_pd instead of _mm_load_ps . As the function names show, two packed doubles are explicitly loaded in the pd class, and the ps version loads four packed floats with the same precision. Since these are purely bit-bit memory movements, and they both use the SSE floating point block, there is no penalty for using _mm_load_ps to load double data. And there is the advantage of _mm_load_ps : its command encoding is one byte shorter than _mm_load_pd , so it is more efficient due to the meaning of the command cache (and, possibly, decoding commands, I am not an expert on all the intricacies of modern x86 processors). The above code using _mm_load_ps will look like this:

 double i0[2]; double i1[2]; __m128d x1 = (__m128d) _mm_load_ps((float *) i0); __m128d x2 = (__m128d) _mm_load_ps((float *) i1); __m128d sum = _mm_add_pd(x1, x2); // do whatever you want to with "sum" now 

There is no function implied by ghosts; it simply forces the compiler to reinterpret the contents of the SSE register as holding doubles instead of floats so that it can be passed to the arithmetic function with double precision _mm_add_pd .

+7
source

The _ps prefix is ​​an abbreviation for "packed single", that is, it is intended for use with a single-precision floating-point, rather than double-precision.

Instead, you want _mm_load_pd() . This function takes a 16-byte aligned pointer to the first element of an array of two double s and loads them both. So you would use it like this:

 __m128d v0 = _mm_load_pd(i0); __m128d v1 = _mm_load_pd(i1); v0 = _mm_add_pd(v0, v1); 
+3
source

Source: https://habr.com/ru/post/1396106/


All Articles