Rearrangements in SSE are not easy. There are many ways to achieve the same results with different combinations of instructions. Different combinations may require a different number of instructions, registers, or memory accesses. Instead of struggling with puzzles like this manually, I prefer just to see what the LLVM compiler does, so I wrote a simple version of your desired permutation in the intermediate LLVM language, which uses an extremely flexible instruction for moving vectors:
define void @shuffle_even_odd(<8 x i16>* %src0) { %src1 = getelementptr <8 x i16>* %src0, i64 1 %a = load <8 x i16>* %src0, align 16 %b = load <8 x i16>* %src1, align 16 %x = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15> %y = shufflevector <8 x i16> %a, <8 x i16> %b, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14> store <8 x i16> %x, <8 x i16>* %src0, align 16 store <8 x i16> %y, <8 x i16>* %src1, align 16 ret void }
Compile this using the LLVM IR-to-ASM llc shuffle_even_odd.ll -o shuffle_even_odd.s : llc shuffle_even_odd.ll -o shuffle_even_odd.s and you get something like the following x86 build:
movdqa (%rdi), %xmm0 movdqa 16(%rdi), %xmm1 movdqa %xmm1, %xmm2 pshufb LCPI0_0(%rip), %xmm2 movdqa %xmm0, %xmm3 pshufb LCPI0_1(%rip), %xmm3 por %xmm2, %xmm3 movdqa %xmm3, (%rdi) pshufb LCPI0_2(%rip), %xmm1 pshufb LCPI0_3(%rip), %xmm0 por %xmm1, %xmm0 movdqa %xmm0, 16(%rdi)
I excluded the constant data sections referenced by LCPIO_ * above, but this roughly corresponds to the following C code:
void shuffle_even_odd(__m128i * src) { __m128i shuffle0 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 2, 3, 6, 7, 10, 11, 14, 15); __m128i shuffle1 = _mm_setr_epi8(2, 3, 6, 7, 10, 11, 14, 15, 128, 128, 128, 128, 128, 128, 128, 128); __m128i shuffle2 = _mm_setr_epi8(128, 128, 128, 128, 128, 128, 128, 128, 0, 1, 4, 5, 8, 9, 12, 13); __m128i shuffle3 = _mm_setr_epi8(0, 1, 4, 5, 8, 9, 12, 13, 128, 128, 128, 128, 128, 128, 128, 128); __m128i a = src[0]; __m128i b = src[1]; src[0] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle0), _mm_shuffle_epi8(a, shuffle1)); src[1] = _mm_or_si128(_mm_shuffle_epi8(b, shuffle2), _mm_shuffle_epi8(a, shuffle3)); }
This is a total of 4 ringtones and 2 bit or instructions. I suspect that these bitwise instructions may be scheduled more efficiently in the CPU pipeline than the proposed unpacking instructions.
You can find the "llc" compiler in the "Clang Binaries" package from the LLVM download page: http://www.llvm.org/releases/download.html