What is the inversion of "_mm256_cvtepi16_epi32"

Question

What is the inversion of "_mm256_cvtepi16_epi32"

I need embedded AVX2 (or earlier) that converts a 32-bit integer vector 8 wide (256 bits total) to an 8-bit 16-bit integer vector (128 bits total) [discarding the top 16-bit of each element]. This should be the inverse of "_mm256_cvtepi16_epi32". If there is no direct instruction, what is the best way to do this using a sequence of instructions?

+4

x86 avx avx2 g ++ intrinsics

Steve burns Apr 08 '18 at 19:19

source share

1 answer

Peter Cordes · Accepted Answer · 2018-04-08T23:20:15+0000

There is no single instruction inverse to the AVX512F. ( ) is also available for 512-> 256 or 128-> low_half_of_128. (Versions with inputs smaller than the 512-bit ZMM register also require AVX512VL, therefore only Skylake-X, not Xeon Phi KNL). __m128i _mm256_cvtepi32_epi16(__m256i a)VPMOVDW

There are signed / unsigned saturation versions of this AVX512 command, but only the AVX512 has a packet instruction that truncates (discarding the top bytes of each element) instead of saturation.

AVX512BW 2- vpermi2w 512- 512- . Skylake-AVX512 , VPMOVDW, , dword (32-). http://instlatx64.atw.hu/ SKops uops/ports.

SSE2/AVX2, _mm256_packus_epi32 (vpackusdw), , 128- . vpmovzxwd.

_mm256_and_si256 . , , packs_epi32 2 256- .

a = H G F E | D C B A    32-bit signed elements, shown from high element to low element, low 128-bit lane on the right
b = P O N M | L K J I

_mm256_packus_epi32(a, b)   16-bit unsigned elements
    P O N M H G F E  |  L K J I D C B A
      elements from first operand go to the low half of each lane

2x vpand/vpackuswd ymm/vpermq ymm, 256- , , , Intel. 2 shuffle uops (4 total uops) 256 , .

SSSE3/AVX2 vpshufb (_mm256_shuffle_epi8) , 128- ( ). AVX2 vpermq, 128.

__m256i trunc_elements = _mm256_shuffle_epi8(res256, shuffle_mask_32_to_16);
__m256i ordered = _mm256_permute4x64_epi64(trunc_elements, 0x58);
__m128i result  = _mm256_castsi256_si128(ordered);   // no asm instructions

, 2 128 , , 5 Intel, AVX2. , , , port0/port1, 128- .

Ryzen/Excavator vpermq ( 256- 128- uops : http://agner.org/optimize/). , vextracti128/vpor . , , vpunpcklqdq, set1_epi64 , 256- , 64 .

What is the inversion of "_mm256_cvtepi16_epi32"

More articles: