Unpacking functions are used here in a funny way. They โduplicateโ the data, instead of adding a character extension, as you would expect. For example, before the first iteration, in your registry, the following
xxxxxxxxxxxxxxab
If you convert a
and b
to 16 bits, you should get the following:
xxxxxxxxxxxx A a B b
Here a
and b
are extensions of the sign of a
and b
, that is, both of them are 0 or -1.
Instead, your code gives
xxxxxxxxxxxx A a B b
And then you convert it to the correct result by moving it to the right.
However, you are not required to use the same operand twice in "unpack" intrinsics. You can get the desired result if you "unpacked" the following two registers:
xxxxxxxxxxxxxxab xxxxxxxxxxxxxx AB
I.e:
a = _mm_unpacklo_epi8(a, _mm_srai_epi8(a, 8));
(if there were inner _mm_srai_epi8
)
You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:
xxxxxxxx AAA a BBB b xxxxxxxx AAAABBBB
To get them, move the 32-bit data to the right:
_mm_srai_epi32(a, 24) _mm_srai_epi32(a, 32)
So the last โunboxingโ is
_mm_unpacklo_epi32(_mm_srai_epi32(a, 24), _mm_srai_epi32(a, 32));