There is no single instruction inverse to the AVX512F. ( ) is also available for 512-> 256 or 128-> low_half_of_128. (Versions with inputs smaller than the 512-bit ZMM register also require AVX512VL, therefore only Skylake-X, not Xeon Phi KNL). __m128i _mm256_cvtepi32_epi16(__m256i a)VPMOVDW
There are signed / unsigned saturation versions of this AVX512 command, but only the AVX512 has a packet instruction that truncates (discarding the top bytes of each element) instead of saturation.
AVX512BW 2- vpermi2w 512- 512- . Skylake-AVX512 , VPMOVDW, , dword (32-). http://instlatx64.atw.hu/ SKops uops/ports.
SSE2/AVX2, _mm256_packus_epi32 (vpackusdw), , 128- . vpmovzxwd.
_mm256_and_si256 . , , packs_epi32 2 256- .
a = H G F E | D C B A 32-bit signed elements, shown from high element to low element, low 128-bit lane on the right
b = P O N M | L K J I
_mm256_packus_epi32(a, b) 16-bit unsigned elements
P O N M H G F E | L K J I D C B A
elements from first operand go to the low half of each lane
2x vpand/vpackuswd ymm/vpermq ymm, 256- , , , Intel. 2 shuffle uops (4 total uops) 256 , .
SSSE3/AVX2 vpshufb (_mm256_shuffle_epi8) , 128- ( ). AVX2 vpermq, 128.
__m256i trunc_elements = _mm256_shuffle_epi8(res256, shuffle_mask_32_to_16);
__m256i ordered = _mm256_permute4x64_epi64(trunc_elements, 0x58);
__m128i result = _mm256_castsi256_si128(ordered); // no asm instructions
, 2 128 , , 5 Intel, AVX2. , , , port0/port1, 128- .
Ryzen/Excavator vpermq ( 256- 128- uops : http://agner.org/optimize/). , vextracti128/vpor . , , vpunpcklqdq, set1_epi64 , 256- , 64 .