Bypass delays when switching domains of an executive unit

I am trying to understand the possible delays in transition when switching domains of executive units.

For example, the following two lines of code give exactly the same result.

_mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8))); _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); 

Which line of code is better to use?

The build output for the first line gives:

 vpslldq xmm1, xmm0, 8 vaddps xmm0, xmm1, xmm0 

The build output for the second line gives:

 vshufps xmm1, xmm0, XMMWORD PTR [rcx], 64 ; 00000040H vaddps xmm2, xmm1, XMMWORD PTR [rcx] 

Now, if I look at the Agner Fog manual microarchitecture, he gives an example on page 112 using integer displacement (pshufd) over float values โ€‹โ€‹compared to using a float (shufps) over float values. Domain switching adds 4 additional clock cycles, so a solution using shufps is better.

In the first line of code that I listed with _mm_slli_si128 , you need to switch the domains between integer and float vectors. The second, using _mm_shuffle_ps , remains in the same domain. Doesn't that mean the second line of code is the best solution?

+6
source share
1 answer

Section 2.1.4 in the optimization guide indicates that you (and Agner) are absolutely right in this matter -

When a micro-operation source executed in one stack comes from a micro-operation performed in another stack, one or two-cycle delay may occur. Delays also occur for transitions between an Intel SSE integer and an Intel SSE floating point operation.

enter image description here

In general, it seems that you are better off staying in the same stack / domain as much as possible.

Of course, benchmarking is always preferable, and all this should be considered only if it really is a bottleneck in your code.

+6
source

Source: https://habr.com/ru/post/956564/


All Articles