How to efficiently perform int8 / int64 conversion using SSE?

Question

How to efficiently perform int8 / int64 conversion using SSE?

I implement conversions between SSE types, and I found that implementing the int8-> int64 extension for purposes up to SSE4.1 is cumbersome.

Direct implementation:

inline __m128i convert_i8_i64(__m128i a) { #ifdef __SSE4_1__ return _mm_cvtepi8_epi64(a); #else a = _mm_unpacklo_epi8(a, a); a = _mm_unpacklo_epi16(a, a); a = _mm_unpacklo_epi32(a, a); return _mm_srai_epi64(a, 56); // missing instrinsic! #endif }

But since _mm_srai_epi64 does not exist before the AVX-512, there are two options at this point:

implementation of _mm_srai_epi64 , or
the implementation of convert_i8_i64 is different.

I am not sure which one would be the most effective solution. Any idea?

+6

c ++ x86 sse simd intrinsics

plasmacel Dec 26 '16 at 19:00

source share

2 answers

With SSSE3, you can use pshufb to avoid most decompressions. Using the designation anatoly a / a :

 ;; input in xmm0 ;; xxxxxxxx | xxxxxxab pshufb xmm0, [low_to_upper] ;; a 0 0 0 0 0 0 0 | b 0 0 0 0 0 0 0 psrad xmm0, 24 ;; AAA a 0 0 0 0 | BBB b 0 0 0 0 pshufb xmm0, [bcast_signextend]; AAAAAAA a | BBBBBBB b

Without SSSE3, I think you can do something with PSHUFLW, PSHUFD, and possibly with POR instead of some PUNPCK steps. But nothing I was thinking about is actually better than decompression if you are not using Core2 or another processor with slow movement, where pshuflw faster than punpcklbw .

+2

Peter Cordes Dec 27 '16 at 23:52

source share

anatolyg · Accepted Answer · 2016-12-26T19:26:29+0000

Unpacking functions are used here in a funny way. They “duplicate” the data, instead of adding a character extension, as you would expect. For example, before the first iteration, in your registry, the following

 xxxxxxxxxxxxxxab

If you convert a and b to 16 bits, you should get the following:

 xxxxxxxxxxxx A a B b

Here a and b are extensions of the sign of a and b , that is, both of them are 0 or -1.

Instead, your code gives

 xxxxxxxxxxxx A a B b

And then you convert it to the correct result by moving it to the right.

However, you are not required to use the same operand twice in "unpack" intrinsics. You can get the desired result if you "unpacked" the following two registers:

 xxxxxxxxxxxxxxab xxxxxxxxxxxxxx AB

I.e:

 a = _mm_unpacklo_epi8(a, _mm_srai_epi8(a, 8));

(if there were inner _mm_srai_epi8 )

You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:

 xxxxxxxx AAA a BBB b xxxxxxxx AAAABBBB

To get them, move the 32-bit data to the right:

 _mm_srai_epi32(a, 24) _mm_srai_epi32(a, 32)

So the last “unboxing” is

 _mm_unpacklo_epi32(_mm_srai_epi32(a, 24), _mm_srai_epi32(a, 32));

How to efficiently perform int8 / int64 conversion using SSE?

More articles: