Shuffle even and odd values ​​in SSE register

I load two SSE 128-bit registers with 16-bit values. The values ​​are in the following order:

src[0] = [E_3, O_3, E_2, O_2, E_1, O_1, E_0, O_0] src[1] = [E_7, O_7, E_6, O_6, E_5, O_5, E_4, O_4] 

I want to achieve this order:

 src[0] = [E_7, E_6, E_5, E_4, E_3, E_2, E_1, E_0] src[1] = [O_7, O_6, O_5, O_4, O_3, O_2, O_1, O_0] 

Do you know if there is a good way to do this (using SSE-intrinsics before SSE 4.2)?

I am stuck at the moment because I can not shuffle the 16-bit values ​​between the upper and lower half of the 128-bit register. I found only the _mm_shufflelo_epi16 and _mm_shufflehi_epi16 .

Update:

Thanks to Paul, I was thinking of using the built-in epi8 tools for 16-bit values.

My solution is as follows:

 shuffle_split = _mm_set_epi8(15, 14, 11, 10, 7, 6, 3, 2, 13, 12, 9, 8, 5, 4, 1, 0); xtmp[0] = _mm_load_si128(src_vec); xtmp[1] = _mm_load_si128(src_vec+1); xtmp[0] = _mm_shuffle_epi8(xtmp[0], shuffle_split); xtmp[1] = _mm_shuffle_epi8(xtmp[1], shuffle_split); xsrc[0] = _mm_unpacklo_epi16(xtmp[0], xtmp[1]); xsrc[0] = _mm_shuffle_epi8(xsrc[0], shuffle_split); xsrc[1] = _mm_unpackhi_epi16(xtmp[0], xtmp[1]); xsrc[1] = _mm_shuffle_epi8(xsrc[1], shuffle_split); 

Is there an even better solution?

+3
source share
1 answer

Rearrangements in SSE are not easy. There are many ways to achieve the same results with different combinations of instructions. Different combinations may require a different number of instructions, registers, or memory accesses. Instead of struggling with puzzles like this manually, I prefer just to see what the LLVM compiler does, so I wrote a simple version of your desired permutation in the intermediate LLVM language, which uses an extremely flexible instruction for moving vectors:

 define void @shuffle_even_odd(<8 x i16>* %src0) { %src1 = getelementptr <8 x i16>* %src0, i64 1 %a = load <8 x i16>* %src0, align 16 %b = load <8 x i16>* %src1, align 16 %x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15> %y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14> store <8 x i16> %x, <8 x i16>* %src0, align 16 store <8 x i16> %y, <8 x i16>* %src1, align 16 ret void } 

Compile this using the LLVM IR-to-ASM llc shuffle_even_odd.ll -o shuffle_even_odd.s : llc shuffle_even_odd.ll -o shuffle_even_odd.s and you get something like the following x86 build:

 movdqa (%rdi), %xmm0 movdqa 16(%rdi), %xmm1 movdqa %xmm1, %xmm2 pshufb LCPI0_0(%rip), %xmm2 movdqa %xmm0, %xmm3 pshufb LCPI0_1(%rip), %xmm3 por %xmm2, %xmm3 movdqa %xmm3, (%rdi) pshufb LCPI0_2(%rip), %xmm1 pshufb LCPI0_3(%rip), %xmm0 por %xmm1, %xmm0 movdqa %xmm0, 16(%rdi) 

I excluded the constant data sections referenced by LCPIO_ * above, but this roughly corresponds to the following C code:

 void shuffle_even_odd(__m128i * src) { __m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15); __m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128); __m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13); __m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128); __m128i a = src[0]; __m128i b = src[1]; src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1)); src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3)); } 

This is a total of 4 ringtones and 2 bit or instructions. I suspect that these bitwise instructions may be scheduled more efficiently in the CPU pipeline than the proposed unpacking instructions.

You can find the "llc" compiler in the "Clang Binaries" package from the LLVM download page: http://www.llvm.org/releases/download.html

+1
source

Source: https://habr.com/ru/post/1270333/


All Articles