Shuffle __m256i vector elements

I want to shuffle the elements of the __m256i vector. And there is a built-in _mm256_shuffle_epi8 that does something like, but it does not perform cross-band shuffling.

How can I do this using AVX2 instructions?

+6
source share
2 answers

There is a way to imitate this operation, but it is not very beautiful:

const __m256i K0 = _mm256_setr_epi8( 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0); const __m256i K1 = _mm256_setr_epi8( 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0xF0, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70, 0x70); inline const __m256i Shuffle(const __m256i & value, const __m256i & shuffle) { return _mm256_or_si256(_mm256_shuffle_epi8(value, _mm256_add_epi8(shuffle, K0)), _mm256_shuffle_epi8(_mm256_permute4x64_epi64(value, 0x4E), _mm256_add_epi8(shuffle, K1))); } 
+8
source

First, an explanation, the usual Intel specification requires that the shuffle pattern be defined in bits 0-3 in each byte for each byte. As you try to shuffle the cross track, your shuffle pattern also uses bit 4 to represent bytes located at a location index above 15 in the YMM register.

Assumptions: what you want to shuffle is in YMM0, and the template is in YMM1.

The code is as follows:

 mask_pattern_0 db 0FH mask_pattern_1 db 10H vpbroadcastb ymm2,byte ptr mask_pattern_0 ; Load the mask vmovdqu ymm5,ymm2 vpsubb ymm3,ymm2,ymm1 ; YMM3 has neg for all those exceeding 15 in original shuffle pattern vpsignb ymm4,ymm1,ymm3 ; YMM4 replicates shuffle pattern with a neg at all those that are above 15 in the original shuffle pattern vperm2i128 ymm2,ymm0,ymm0,00010001b ; Save the upper 128 bits of the target YMM0 to YMM2 in both upper and lower 128 bits vperm2i128 ymm0,ymm0,ymm0,00100000b ; This replicates the lower 128 bits of YMM0 to upper 128 bits of YMM0 vpshufb ymm0,ymm0,ymm4 ; This places all those with index below 16 to appropriate place, and sets a zero to other bytes ;We now process the entries in shuffle pattern with index above 15 vpsubb ymm3,ymm1,ymm5 ; Now all those above 15 have a positive value vpsignb ymm4,ymm1,ymm3 ; YMM4 has negatives for all those below 15 in original shuffle pattern YMM1 vpbroadcastb ymm5,byte ptr mask_pattern_1 ; Load the mask value 10H vpsubb ymm4,ymm4,ymm5 vpshufb ymm2,ymm2,ymm4 ; Save the shuffle in YMM2 vpaddb ymm0,ymm0,ymm2 

It also ensures that the template contained in YMM1 is not affected - as is true for the VPSHUFB instruction.

Trust this helps ...

+1
source

Source: https://habr.com/ru/post/988587/


All Articles