it should be possible to get the descent code by simply inverting the operations
Get used to the frustration and frustration of the non-orthogonal tattoos at Intel. There punpck no direct converse for punpck . SSE / AVX pack instructions are designed to narrow the size of an item. (So ββone packusdw is the inverse punpck[lh]wd against zero, but not when used with two arbitrary vectors). In addition, pack instructions are only available for 32-> 16 (word for word) and 16-> 8 (word for byte). No packusqd (64-> 32).
PACK instructions are only available with saturation, not truncation (up to AVX512 vpmovqd ), so for this use case we need to prepare 4 different input vectors for two PACK instructions. This turns out to be terrible, much worse than your 3-shuffle solution (see unzip32_pack() in the Godbolt link below).
There is a 2-paste file that will do what you want for 32-bit elements: shufps . The low 2 elements of the result can be any 2 elements of the first vector, and the high 2-element can be any element of the second vector. The shuffle we want is suitable for these restrictions, so we can use it.
We can solve the whole problem in two instructions (plus a movdqa for the version without AVX, because shufps destroys the left input register):
inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3) _mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0));
_MM_SHUFFLE() uses the first notation of the most significant elements , like all Intel documentation. Your notation is the opposite.
The only property for shufps uses __m128 / __m256 vectors ( float not integer), so you need to use it to use it. _mm_castsi128_ps is a reinterpret_cast: it compiles to null instructions.
gcc will embed them in one command each. With remote static inline we can see how they will compile as non-built-in functions. I put them on the Godbolt compiler explorer
unziplo(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 136 ret unziphi(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 221 ret
Using FP shuffling for integer data is great for the latest Intel / AMD processors. There is no additional delay in the transition delay (see this answer , which summarizes what the Agner Fog Microarchive Guide says about this). It has an extra delay on Intel Nehalem, but it might still be a better choice. FP load / tuffles will not crash or corrupt integer bit patterns representing NaN , only the actual FP math instructions take care of this.
Interesting fact: on AMD Bulldozer (and Intel Core2) processors, FP is shuffled like shufps , it still runs in the ivec domain, so they actually have an additional delay when used between FP instructions, but not between whole instructions!
Unlike the ARM NEON / ARMv8 SIMD, the x86 SSE does not have instructions with a 2-output register , and they are rare in x86. (They exist, for example, mul r64 , but are always decoded into several uops on current processors).
At least 2 teams are always required to create 2 result vectors . It would be ideal if they did not need to run on the shuffle port, since the latest Intel processors have a bandwidth of 1 time per hour. The parallelism level of the instruction level does not help much when all your instructions are shuffled.
For throughput, 1 shuffle + 2 shuffles can be more efficient than 2 shuffles and have the same delay. Or even 2 shuffles and 2 mixtures can be more effective than 3 shuffles, depending on the bottleneck in the surrounding code. But I donβt think we can replace 2x shufps with a few instructions.
Without shufps :
Your shuffle + unboxing / hi is pretty good. In total there would be 4 shuffles: 2 pshufd for preparing the inputs, then 2 punpck l / h. This is likely to be worse than any bypass latency, with the exception of Nehalem, in cases where latency matters, but bandwidth doesn't work.
It seems that any other option requires the preparation of 4 input vectors for combination or packss . See @Mysticial answer to _mm_shuffle_ps () equivalent for integer vectors (__m128i)? for the blend option. For two outputs, only 4 shuffles are required to complete the input, and then 2x pblendw (fast) or vpblendd (even faster).
Using packsswd or wb for 16 or 8 bit elements will also work. To mask the odd elements a and b would require 2x pand and 2x psrld to shift the odd elements to even positions. This sets up for 2x packsswd to create two output vectors. 6 complete instructions plus a lot of movdqa , because they all destroy their inputs (unlike pshufd , which is copy + shuffle).
Nehalem is the only processor where it is worth using something other than 2x shufps , due to the high bypass delay (2c). It has a bandwidth of 2 times per cycle, and pshufd is copy + shuffle, so to create copies of a and b to create copies of a and b only need one additional movdqa to get punpckldq and punpckhdq in separate registers. ( movdqa not free, it has a delay of 1 s and it needs a port for executing bills on Nehalem. It is cheaper than shuffling if you are narrowly profiled for bandwidth in a random order, and not for overall interface bandwidth (uop bandwidth) or something else.)
I highly recommend just using 2x shufps . It will be good on an average processor and not terrible anywhere.
AVX512
An AVX512 instruction was introduced with cross-connection with truncation, which narrows one vector (instead of moving to 2 inputs). It is inverse to pmovzx and can narrow 64b-> 8b or any other combination, not just 2 times.
In this case, __m256i _mm512_cvtepi64_epi32 (__m512i a) ( vpmovqd ) will take even 32-bit elements from the vector and pack them together (that is, the lower halves of each 64-bit element). However, this is not a very good building block for alternation, since you need something else to get the odd elements.
It also comes in signature / unsigned saturation versions. The instructions even have a form of memory allocation that intrinsics exposes, so you can make a masked store.
But for this problem, as Mystical points out, the AVX512 provides transition transitions with 2 inputs, which you can use as shufps to solve the whole problem in just two shuffles: vpermi2d/vpermt2d .