My code relies on blendv, checking only the most significant bit.
You have two good options:
Broadcast a high bit inside each element using an arithmetic right shift of 31 to configure VPBLENDVB ( _mm256_blendv_epi8 ) , i.e. VPSRAD: mask=_mm256_srai_epi32(mask, 31) .
VPSRAD - 1-uop for Intel Haswell, for port0. (Big bandwidth on Skylake: p01). If your algorithm has bottlenecks in port 0 (for example, integer multiplication and shift), this is not very convenient.
Use VBLENDVPS . You are right that all throws are just to make the compiler happy, and that VBLENDVPS will do exactly what you want in one instruction.
However, Intel SnB-family processors have a bypass delay of 1 cycle when transferring integer results to the FP mixing unit and another 1 s delay when transferring the mixing results to other integer instructions.
For more information on bypass delay, see the Agner Fog Microchip Guide . This is the reason why they do not do __m256i internals for FP instructions and vice versa. Please note that since Sandybridge, shuffling the FP does not have an additional delay for sending from / to instructions such as PADDD. Thus, SHUFPS is a great way to combine data from two integer vectors if PUNPCK * or PALIGNR do not do exactly what you want. (SHUFPS on integers can even cost on Nehalem, where he has a penalty of 2 c in both directions).
Try in both directions and in the test . In any case, it might be better, depending on the surrounding code.
The delay may not matter compared to the bandwidth uop / number of commands. Also note that if you simply store the result in memory, it is not important to store instructions in which domain the data comes in.
But if you use this as part of a long chain of dependencies, then perhaps an additional instruction is worth it to avoid the extra 2 latency cycles for mixed data.
Please note that if mask generation is on a critical path, then the VPSRAD 1 cycle delay is equivalent to the bypass delay delay, so using an FP mix is ββjust 1 additional latency cycle for the mask-> result chain versus 2 additional cycles for the data-> result chain .
For example, what happens when the integer bit pattern is a floating point NAN?
BLENDVPS anyway. Intel insn ref manual fully documents everything that the instruction may / cannot execute , and SIMD floating point exceptions : No one means that this is not a problem. See Also x86 wiki tags for links to documents.
FP blend / shuffle / bitwise-boolean / load / store commands do not care about NaN. Only instructions that perform actual FP math (including CMPPS, MINPS, etc.) will raise FP exceptions or may slow down with denormal.
I correctly understood that there is an 8-bit, 32-bit and 64-bit blendv, but there is no 16-bit blendv?
Yes. But there are 32 and 16-bit arithmetic shifts, so it costs no more than one additional instruction to use an 8-bit granularity mixture. (There is no PSRAQ, therefore a blendv of 64-bit integers is best done with BLENDVPD if, possibly, mask generation is not on the critical path and / or the same mask will be reused on the critical path.)
The most common use case is for comparison masks, where each element already has all or all zeros, so you can mix with PAND / PANDN => POR. Of course, smart tricks that leave only the sign bit of your mask with a true value can preserve instructions and latency, especially since the mixture variables are somewhat faster than three Boolean bitwise instructions. (for example, ORPS are two floating point vectors to see if they are non-negative, not 2x CMPPS and ORing for masks. This might work fine if you don't need a negative zero).