Consequences of using _mm_shuffle_ps for an integer vector

SSE's built-in features include _mm_shuffle_ps xmm1 xmm2 immx , which allows you to select 2 elements from xmm1 combined with two elements from xmm2 . However, this applies to floats (meaning _ps, packed single). However, if you plot your packed __ m128i integers , you can also use _mm_shuffle_ps:

 #include <iostream> #include <immintrin.h> #include <sstream> using namespace std; template <typename T> std::string __m128i_toString(const __m128i var) { std::stringstream sstr; const T* values = (const T*) &var; if (sizeof(T) == 1) { for (unsigned int i = 0; i < sizeof(__m128i); i++) { sstr << (int) values[i] << " "; } } else { for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) { sstr << values[i] << " "; } } return sstr.str(); } int main(){ cout << "Starting SSE test" << endl; cout << "integer shuffle" << endl; int A[] = {1, -2147483648, 3, 5}; int B[] = {4, 6, 7, 8}; __m128i pC; __m128i* pA = (__m128i*) A; __m128i* pB = (__m128i*) B; *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0)); pC = _mm_add_epi32(*pA,*pB); cout << "A[0] = " << A[0] << endl; cout << "A[1] = " << A[1] << endl; cout << "A[2] = " << A[2] << endl; cout << "A[3] = " << A[3] << endl; cout << "B[0] = " << B[0] << endl; cout << "B[1] = " << B[1] << endl; cout << "B[2] = " << B[2] << endl; cout << "B[3] = " << B[3] << endl; cout << "pA = " << __m128i_toString<int>(*pA) << endl; cout << "pC = " << __m128i_toString<int>(pC) << endl; } 

Fragment of the corresponding corresponding assembly (mac osx, macports gcc 4.8, -march = native on ivybridge processor):

 vshufps $228, 16(%rsp), %xmm1, %xmm0 vpaddd 16(%rsp), %xmm0, %xmm2 vmovdqa %xmm0, 32(%rsp) vmovaps %xmm0, (%rsp) vmovdqa %xmm2, 16(%rsp) call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc .... 

Thus, it seems to work just fine on the integers that I expected, since registers are agnostically of types, however there should be a reason why the docs say that this instruction is for float only. Does anyone know any flaws or consequences that I missed?

+4
source share
1 answer

For integers, there is no equivalent to _mm_shuffle_ps . To achieve the same effect in this case, you can do

SSE2

 *pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8); 

SSE4.1

 *pA = _mm_blend_epi16(*pA, *pB, 0xf0); 

or change a floating point domain like this

 *pA = _mm_castps_si128( _mm_shuffle_ps(_mm_castsi128_ps(*pA), _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0))); 

But changing domains can result in delayed interception delays on some processors. Keep in mind that according to Agner

Bypass delay is important in long dependency chains where latency is a bottleneck, but not where bandwidth is important, not latency.

You need to check your code and find out which method above is more efficient.

Fortunately, on most Intel / AMD processors there is usually no penalty for using shufps between most of the whole vector instructions. Agner says:

For example, I did not find any delays when mixing PADDD and shufps [on Sandybridge].

Nehalem has 2 delay delays for sending to / from shufps , but even then one shufps often still faster than several other commands. Additional instructions also have latency as well as bandwidth costs.


The converse (integer shuffling between mathematical FP instructions) is not so safe:

In the Agner Fog microarchitecture on page 112 in Example 8.3a, it shows that using PSHUFD ( _mm_shuffle_epi32 ) instead of shufps ( _mm_shuffle_ps ) when a bypass delay of four clock cycles occurs in the floating-point domain. In Example 8.3b, it uses SHUFPS to remove the delay (which works in its example).

Nehalem actually has five domains. Sass seems to be the most effective (workarounds did not exist before Sass). At Sandy Bridge, the delay is less significant. This is even more true for Haswell. In fact, according to Haswell Agner, he did not detect any delays between shufps or PSHUFD (see page 140).

+4
source

Source: https://habr.com/ru/post/1270330/


All Articles