I am trying to understand the possible delays in transition when switching domains of executive units.
For example, the following two lines of code give exactly the same result.
_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40));
Which line of code is better to use?
The build output for the first line gives:
vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0
The build output for the second line gives:
vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H vaddps xmm2, xmm1, XMMWORD PTR [rcx]
Now, if I look at the Agner Fog manual microarchitecture, he gives an example on page 112 using integer displacement (pshufd) over float values โโcompared to using a float (shufps) over float values. Domain switching adds 4 additional clock cycles, so a solution using shufps is better.
In the first line of code that I listed with _mm_slli_si128 , you need to switch the domains between integer and float vectors. The second, using _mm_shuffle_ps , remains in the same domain. Doesn't that mean the second line of code is the best solution?