SSE Instructions: Byte + Short

I have very long byte arrays that need to be added to the target array of type short (or int ). Is there such an SSE instruction? Or maybe their set?

+6
source share
2 answers

You need to unpack each vector of 8-bit values ​​into two vectors of 16-bit values, and then add them.

 __m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 } __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 } 

where v is a vector of values ​​16 x 8 bits, and vl , vh are two unpacked vectors of 8 x 16 bits.

Please note that I assume that 8-bit values ​​are unsigned, so when decompressing up to 16 bits, the high byte is set to 0 (i.e., without a character extension).

If you want to summarize a lot of these vectors and get a 32-bit result, a useful trick is to use _mm_madd_epi16 with a factor of 1, for example.

 __m128i vsuml = _mm_set1_epi32(0); __m128i vsumh = _mm_set1_epi32(0); __m128i vsum; int sum; for (int i = 0; i < N; i += 16) { __m128i v = _mm_load_si128(&x[i]); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1))); vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1))); } // do horizontal sum of 4 partial sums and store in scalar int vsum = _mm_add_epi32(vsuml, vsumh); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8)); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4)); sum = _mm_cvtsi128_si32(vsum); 
+6
source

If you need to sign extended byte vectors instead of a null extension, use pmovsxbw ( _mm_cvtepi8_epi16 ) . Unlike unpacking hi / lo instructions, you can only get pmovsx from the lower half / quarter / eighth of the src register.

You can pmovsx directly from memory, though, although intrinsics make it really awkward. Since random throughput is more limited than load on most processors, it is probably preferable to make two loads + pmovsx than for one load + three shuffles.

0
source

Source: https://habr.com/ru/post/915991/


All Articles