Sse / avx equivalent for neon vuzp

Universal vector extensions SSE, AVX, etc. provide two unpacking operations for each element size, for example. SSE identifiers _mm_unpacklo_* and _mm_unpackhi_* . For 4 elements in a vector, he does the following:

 inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3) 

The unboxing equivalent is vzip in the ARM NEON instruction set. However, the NEON instruction set also provides the vuzp operation, which is the inverse of vzip . For 4 elements in a vector, he does the following:

 inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) vuzp: (A0 A2 B0 B2) (A1 A3 B1 B3) 

How can vuzp be effectively implemented using built-in SSE or AVX? There seems to be no instruction for this. For 4 elements, I assume that this can be done by shuffling and then unpacking the moving 2 elements:

 inputs: (A0 A1 A2 A3) (B0 B1 B2 B3) shuffle: (A0 A2 A1 A3) (B0 B2 B1 B3) unpacklo/hi 2: (A0 A2 B0 B2) (A1 A3 B1 B3) 

Is there a more efficient solution using one command? (Perhaps first for SSE - I know that for AVX we may have an additional problem: shuffling and unpacking do not cross the bands.)

Knowing this can be useful for writing code for swizzling and deswizzling data (it should be possible to get the descent code by simply inverting the swizzling code based on the decompression operations).

Edit: Here is the 8-element version: This is the effect of NEON vuzp :

 input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7) vuzp: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7) 

This is my version with one shuffle and one unpack for each output element (it seems to generalize to larger element numbers):

 input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7) shuffle: (A0 A2 A4 A6 A1 A3 A5 A7) (B0 B2 B4 B6 B1 B3 B5 B7) unpacklo/hi 4: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7) 

The method proposed by EOF is correct, but for each output, the operations log2(8)=3 unpack are required:

 input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7) unpacklo/hi 1: (A0 B0 A1 B1 A2 B2 A3 B3) (A4 B4 A5 B5 A6 B6 A7 B7) unpacklo/hi 1: (A0 A4 B0 B4 A1 A5 B1 B5) (A2 A6 B2 B6 A3 A7 B3 B7) unpacklo/hi 1: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7) 
+5
source share
1 answer

it should be possible to get the descent code by simply inverting the operations

Get used to the frustration and frustration of the non-orthogonal tattoos at Intel. There punpck no direct converse for punpck . SSE / AVX pack instructions are designed to narrow the size of an item. (So ​​one packusdw is the inverse punpck[lh]wd against zero, but not when used with two arbitrary vectors). In addition, pack instructions are only available for 32-> 16 (word for word) and 16-> 8 (word for byte). No packusqd (64-> 32).

PACK instructions are only available with saturation, not truncation (up to AVX512 vpmovqd ), so for this use case we need to prepare 4 different input vectors for two PACK instructions. This turns out to be terrible, much worse than your 3-shuffle solution (see unzip32_pack() in the Godbolt link below).


There is a 2-paste file that will do what you want for 32-bit elements: shufps . The low 2 elements of the result can be any 2 elements of the first vector, and the high 2-element can be any element of the second vector. The shuffle we want is suitable for these restrictions, so we can use it.

We can solve the whole problem in two instructions (plus a movdqa for the version without AVX, because shufps destroys the left input register):

 inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3) _mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0)); // (A0 A2 B0 B2) _mm_shuffle_ps(a,b,_MM_SHUFFLE(3,1,3,1)); // (A1 A3 B1 B3) 

_MM_SHUFFLE() uses the first notation of the most significant elements , like all Intel documentation. Your notation is the opposite.

The only property for shufps uses __m128 / __m256 vectors ( float not integer), so you need to use it to use it. _mm_castsi128_ps is a reinterpret_cast: it compiles to null instructions.

 #include <immintrin.h> static inline __m128i unziplo(__m128i a, __m128i b) { __m128 aps = _mm_castsi128_ps(a); __m128 bps = _mm_castsi128_ps(b); __m128 lo = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(2,0,2,0)); return _mm_castps_si128(lo); } static inline __m128i unziphi(__m128i a, __m128i b) { __m128 aps = _mm_castsi128_ps(a); __m128 bps = _mm_castsi128_ps(b); __m128 hi = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(3,1,3,1)); return _mm_castps_si128(hi); } 

gcc will embed them in one command each. With remote static inline we can see how they will compile as non-built-in functions. I put them on the Godbolt compiler explorer

 unziplo(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 136 ret unziphi(long long __vector(2), long long __vector(2)): shufps xmm0, xmm1, 221 ret 

Using FP shuffling for integer data is great for the latest Intel / AMD processors. There is no additional delay in the transition delay (see this answer , which summarizes what the Agner Fog Microarchive Guide says about this). It has an extra delay on Intel Nehalem, but it might still be a better choice. FP load / tuffles will not crash or corrupt integer bit patterns representing NaN , only the actual FP math instructions take care of this.

Interesting fact: on AMD Bulldozer (and Intel Core2) processors, FP is shuffled like shufps , it still runs in the ivec domain, so they actually have an additional delay when used between FP instructions, but not between whole instructions!


Unlike the ARM NEON / ARMv8 SIMD, the x86 SSE does not have instructions with a 2-output register , and they are rare in x86. (They exist, for example, mul r64 , but are always decoded into several uops on current processors).

At least 2 teams are always required to create 2 result vectors . It would be ideal if they did not need to run on the shuffle port, since the latest Intel processors have a bandwidth of 1 time per hour. The parallelism level of the instruction level does not help much when all your instructions are shuffled.

For throughput, 1 shuffle + 2 shuffles can be more efficient than 2 shuffles and have the same delay. Or even 2 shuffles and 2 mixtures can be more effective than 3 shuffles, depending on the bottleneck in the surrounding code. But I don’t think we can replace 2x shufps with a few instructions.


Without shufps :

Your shuffle + unboxing / hi is pretty good. In total there would be 4 shuffles: 2 pshufd for preparing the inputs, then 2 punpck l / h. This is likely to be worse than any bypass latency, with the exception of Nehalem, in cases where latency matters, but bandwidth doesn't work.

It seems that any other option requires the preparation of 4 input vectors for combination or packss . See @Mysticial answer to _mm_shuffle_ps () equivalent for integer vectors (__m128i)? for the blend option. For two outputs, only 4 shuffles are required to complete the input, and then 2x pblendw (fast) or vpblendd (even faster).

Using packsswd or wb for 16 or 8 bit elements will also work. To mask the odd elements a and b would require 2x pand and 2x psrld to shift the odd elements to even positions. This sets up for 2x packsswd to create two output vectors. 6 complete instructions plus a lot of movdqa , because they all destroy their inputs (unlike pshufd , which is copy + shuffle).

 // don't use this, it not optimal for any CPU void unzip32_pack(__m128i &a, __m128i &b) { __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0)); __m128i a_odd = _mm_srli_epi64(a, 32); __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0)); __m128i b_odd = _mm_srli_epi64(b, 32); __m128i lo = _mm_packs_epi16(a_even, b_even); __m128i hi = _mm_packs_epi16(a_odd, b_odd); a = lo; b = hi; } 

Nehalem is the only processor where it is worth using something other than 2x shufps , due to the high bypass delay (2c). It has a bandwidth of 2 times per cycle, and pshufd is copy + shuffle, so to create copies of a and b to create copies of a and b only need one additional movdqa to get punpckldq and punpckhdq in separate registers. ( movdqa not free, it has a delay of 1 s and it needs a port for executing bills on Nehalem. It is cheaper than shuffling if you are narrowly profiled for bandwidth in a random order, and not for overall interface bandwidth (uop bandwidth) or something else.)

I highly recommend just using 2x shufps . It will be good on an average processor and not terrible anywhere.


AVX512

An AVX512 instruction was introduced with cross-connection with truncation, which narrows one vector (instead of moving to 2 inputs). It is inverse to pmovzx and can narrow 64b-> 8b or any other combination, not just 2 times.

In this case, __m256i _mm512_cvtepi64_epi32 (__m512i a) ( vpmovqd ) will take even 32-bit elements from the vector and pack them together (that is, the lower halves of each 64-bit element). However, this is not a very good building block for alternation, since you need something else to get the odd elements.

It also comes in signature / unsigned saturation versions. The instructions even have a form of memory allocation that intrinsics exposes, so you can make a masked store.

But for this problem, as Mystical points out, the AVX512 provides transition transitions with 2 inputs, which you can use as shufps to solve the whole problem in just two shuffles: vpermi2d/vpermt2d .

+4
source

Source: https://habr.com/ru/post/1270329/


All Articles