How to efficiently perform int8 / int64 conversion using SSE?

I implement conversions between SSE types, and I found that implementing the int8-> int64 extension for purposes up to SSE4.1 is cumbersome.

Direct implementation:

inline __m128i convert_i8_i64(__m128i a) { #ifdef __SSE4_1__ return _mm_cvtepi8_epi64(a); #else a = _mm_unpacklo_epi8(a, a); a = _mm_unpacklo_epi16(a, a); a = _mm_unpacklo_epi32(a, a); return _mm_srai_epi64(a, 56); // missing instrinsic! #endif } 

But since _mm_srai_epi64 does not exist before the AVX-512, there are two options at this point:

  • implementation of _mm_srai_epi64 , or
  • the implementation of convert_i8_i64 is different.

I am not sure which one would be the most effective solution. Any idea?

+6
source share
2 answers

Unpacking functions are used here in a funny way. They โ€œduplicateโ€ the data, instead of adding a character extension, as you would expect. For example, before the first iteration, in your registry, the following

 xxxxxxxxxxxxxxab 

If you convert a and b to 16 bits, you should get the following:

 xxxxxxxxxxxx A a B b 

Here a and b are extensions of the sign of a and b , that is, both of them are 0 or -1.

Instead, your code gives

 xxxxxxxxxxxx A a B b 

And then you convert it to the correct result by moving it to the right.

However, you are not required to use the same operand twice in "unpack" intrinsics. You can get the desired result if you "unpacked" the following two registers:

 xxxxxxxxxxxxxxab xxxxxxxxxxxxxx AB 

I.e:

 a = _mm_unpacklo_epi8(a, _mm_srai_epi8(a, 8)); 

(if there were inner _mm_srai_epi8 )


You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:

 xxxxxxxx AAA a BBB b xxxxxxxx AAAABBBB 

To get them, move the 32-bit data to the right:

 _mm_srai_epi32(a, 24) _mm_srai_epi32(a, 32) 

So the last โ€œunboxingโ€ is

 _mm_unpacklo_epi32(_mm_srai_epi32(a, 24), _mm_srai_epi32(a, 32)); 
+4
source

With SSSE3, you can use pshufb to avoid most decompressions. Using the designation anatoly a / a :

 ;; input in xmm0 ;; xxxxxxxx | xxxxxxab pshufb xmm0, [low_to_upper] ;; a 0 0 0 0 0 0 0 | b 0 0 0 0 0 0 0 psrad xmm0, 24 ;; AAA a 0 0 0 0 | BBB b 0 0 0 0 pshufb xmm0, [bcast_signextend]; AAAAAAA a | BBBBBBB b 

Without SSSE3, I think you can do something with PSHUFLW, PSHUFD, and possibly with POR instead of some PUNPCK steps. But nothing I was thinking about is actually better than decompression if you are not using Core2 or another processor with slow movement, where pshuflw faster than punpcklbw .

+2
source

Source: https://habr.com/ru/post/1013529/


All Articles