Howto vblend for a 32-bit integer? or: Why not _mm256_blendv_epi32?

I use AVX2 x86 256-bit SIMD extensions. I want to make a 32-bit integer component a wise if-then-else statement. In Intel docs, this instruction is called vblend.

Intel's built-in guide contains the _mm256_blendv_epi8 function. This function does almost what I need. The only problem is that it works with 8-bit integers. Unfortunately, there is no _mm256_blendv_epi32 in the docs. My first question is: why does this feature not exist? My second question: how to imitate him?

After some searching, I found _mm256_blendv_ps, which does what I want for 32-bit floats. Next, I found the casting functions _mm256_castsi256_ps and _mm256_castps_si256, which are cast from integers to 32-bit floats and vice versa. Combining this data yields:

inline __m256i _mm256_blendv_epi32 (__m256i a, __m256i b, __m256i mask){ return _mm256_castps_si256( _mm256_blendv_ps( _mm256_castsi256_ps(a), _mm256_castsi256_ps(b), _mm256_castsi256_ps(mask) ) ); } 

So far, it looks like 5 functions, 4 of them are just illustrious throws, and one is displayed directly on the processor instruction. Thus, the whole function comes down to one processor instruction.

So the real inconvenient part is that there seems to be a 32-bit blendv, except that there is no corresponding internal value.

Is there any borderline case when this fails? For example, what happens when the integer bit pattern is a floating point NAN? Does Blendv just ignore it or raise a signal?

In case this works: am I correcting that there is an 8-bit, 32-bit, and 64-bit blendv but no 16-bit blendv?

+1
source share
1 answer

My code relies on blendv, checking only the most significant bit.

You have two good options:

  • Broadcast a high bit inside each element using an arithmetic right shift of 31 to configure VPBLENDVB ( _mm256_blendv_epi8 ) , i.e. VPSRAD: mask=_mm256_srai_epi32(mask, 31) .

    VPSRAD - 1-uop for Intel Haswell, for port0. (Big bandwidth on Skylake: p01). If your algorithm has bottlenecks in port 0 (for example, integer multiplication and shift), this is not very convenient.

  • Use VBLENDVPS . You are right that all throws are just to make the compiler happy, and that VBLENDVPS will do exactly what you want in one instruction.

    However, Intel SnB-family processors have a bypass delay of 1 cycle when transferring integer results to the FP mixing unit and another 1 s delay when transferring the mixing results to other integer instructions.

For more information on bypass delay, see the Agner Fog Microchip Guide . This is the reason why they do not do __m256i internals for FP instructions and vice versa. Please note that since Sandybridge, shuffling the FP does not have an additional delay for sending from / to instructions such as PADDD. Thus, SHUFPS is a great way to combine data from two integer vectors if PUNPCK * or PALIGNR do not do exactly what you want. (SHUFPS on integers can even cost on Nehalem, where he has a penalty of 2 c in both directions).

Try in both directions and in the test . In any case, it might be better, depending on the surrounding code.

The delay may not matter compared to the bandwidth uop / number of commands. Also note that if you simply store the result in memory, it is not important to store instructions in which domain the data comes in.

But if you use this as part of a long chain of dependencies, then perhaps an additional instruction is worth it to avoid the extra 2 latency cycles for mixed data.

Please note that if mask generation is on a critical path, then the VPSRAD 1 cycle delay is equivalent to the bypass delay delay, so using an FP mix is ​​just 1 additional latency cycle for the mask-> result chain versus 2 additional cycles for the data-> result chain .


For example, what happens when the integer bit pattern is a floating point NAN?

BLENDVPS anyway. Intel insn ref manual fully documents everything that the instruction may / cannot execute , and SIMD floating point exceptions : No one means that this is not a problem. See Also wiki tags for links to documents.

FP blend / shuffle / bitwise-boolean / load / store commands do not care about NaN. Only instructions that perform actual FP math (including CMPPS, MINPS, etc.) will raise FP exceptions or may slow down with denormal.


I correctly understood that there is an 8-bit, 32-bit and 64-bit blendv, but there is no 16-bit blendv?

Yes. But there are 32 and 16-bit arithmetic shifts, so it costs no more than one additional instruction to use an 8-bit granularity mixture. (There is no PSRAQ, therefore a blendv of 64-bit integers is best done with BLENDVPD if, possibly, mask generation is not on the critical path and / or the same mask will be reused on the critical path.)

The most common use case is for comparison masks, where each element already has all or all zeros, so you can mix with PAND / PANDN => POR. Of course, smart tricks that leave only the sign bit of your mask with a true value can preserve instructions and latency, especially since the mixture variables are somewhat faster than three Boolean bitwise instructions. (for example, ORPS are two floating point vectors to see if they are non-negative, not 2x CMPPS and ORing for masks. This might work fine if you don't need a negative zero).

+2
source

Source: https://habr.com/ru/post/1270331/


All Articles