Simulating packusdw with SSE2

I implement the quick conversion function x888 β†’ 565 pixels to pixman according to the described algorithm from Intel [pdf] . Their code converts x888 β†’ 555, while I want to convert it to 565. Unfortunately, converting to 565 means that the high bit is set, which means that I cannot use the saturated packet package instructions. The unsigned pack instruction, packusdw was not added until SSE4.1. I would like to implement its functionality using SSE2 or find another way to do this.

This function accepts two XMM registers containing 4 32-bit pixels each and outputs one XMM register containing 8 converted RGB565 pixels.

static force_inline __m128i pack_565_2packedx128_128 (__m128i lo, __m128i hi) { __m128i rb0 = _mm_and_si128 (lo, mask_565_rb); __m128i rb1 = _mm_and_si128 (hi, mask_565_rb); __m128i t0 = _mm_madd_epi16 (rb0, mask_565_pack_multiplier); __m128i t1 = _mm_madd_epi16 (rb1, mask_565_pack_multiplier); __m128i g0 = _mm_and_si128 (lo, mask_green); __m128i g1 = _mm_and_si128 (hi, mask_green); t0 = _mm_or_si128 (t0, g0); t1 = _mm_or_si128 (t1, g1); t0 = _mm_srli_epi32 (t0, 5); t1 = _mm_srli_epi32 (t1, 5); /* XXX: maybe there a way to do this relatively efficiently with SSE2? */ return _mm_packus_epi32 (t0, t1); } 

Ideas I thought of:

  • Subtract 0x8000, _mm_packs_epi32, re-add 0x8000 to every 565 pixels. I tried this, but I can not do this job.

     t0 = _mm_sub_epi16 (t0, mask_8000); t1 = _mm_sub_epi16 (t1, mask_8000); t0 = _mm_packs_epi32 (t0, t1); return _mm_add_epi16 (t0, mask_8000); 
  • Shuffle data instead of packaging. Works for MMX, but since SSE 16-bit shuffles work only with high or low 64-bit, it will be messy.

  • Save the high bits, set them to zero, execute the packet, then restore them. Seems pretty messy.

Are there other (hopefully more effective) ways I could do this?

+6
source share
1 answer

You can sign the values ​​first and then use _mm_packs_epi32 :

 t0 = _mm_slli_epi32 (t0, 16); t0 = _mm_srai_epi32 (t0, 16); t1 = _mm_slli_epi32 (t1, 16); t1 = _mm_srai_epi32 (t1, 16); t0 = _mm_packs_epi32 (t0, t1); 

You could combine this with previous shifts to save two instructions:

 t0 = _mm_slli_epi32 (t0, 16 - 5); t0 = _mm_srai_epi32 (t0, 16); t1 = _mm_slli_epi32 (t1, 16 - 5); t1 = _mm_srai_epi32 (t1, 16); t0 = _mm_packs_epi32 (t0, t1); 
+5
source

Source: https://habr.com/ru/post/918046/


All Articles