For integers, there is no equivalent to _mm_shuffle_ps . To achieve the same effect in this case, you can do
SSE2
*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);
SSE4.1
*pA = _mm_blend_epi16(*pA, *pB, 0xf0);
or change a floating point domain like this
*pA = _mm_castps_si128( _mm_shuffle_ps(_mm_castsi128_ps(*pA), _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));
But changing domains can result in delayed interception delays on some processors. Keep in mind that according to Agner
Bypass delay is important in long dependency chains where latency is a bottleneck, but not where bandwidth is important, not latency.
You need to check your code and find out which method above is more efficient.
Fortunately, on most Intel / AMD processors there is usually no penalty for using shufps between most of the whole vector instructions. Agner says:
For example, I did not find any delays when mixing PADDD and shufps [on Sandybridge].
Nehalem has 2 delay delays for sending to / from shufps , but even then one shufps often still faster than several other commands. Additional instructions also have latency as well as bandwidth costs.
The converse (integer shuffling between mathematical FP instructions) is not so safe:
In the Agner Fog microarchitecture on page 112 in Example 8.3a, it shows that using PSHUFD ( _mm_shuffle_epi32 ) instead of shufps ( _mm_shuffle_ps ) when a bypass delay of four clock cycles occurs in the floating-point domain. In Example 8.3b, it uses SHUFPS to remove the delay (which works in its example).
Nehalem actually has five domains. Sass seems to be the most effective (workarounds did not exist before Sass). At Sandy Bridge, the delay is less significant. This is even more true for Haswell. In fact, according to Haswell Agner, he did not detect any delays between shufps or PSHUFD (see page 140).