SSE Instructions: Byte + Short

Question

SSE Instructions: Byte + Short

I have very long byte arrays that need to be added to the target array of type short (or int ). Is there such an SSE instruction? Or maybe their set?

+6

x86 sse instructions

dajuric May 17 '12 at 14:02

source share

2 answers

If you need to sign extended byte vectors instead of a null extension, use pmovsxbw ( _mm_cvtepi8_epi16 ) . Unlike unpacking hi / lo instructions, you can only get pmovsx from the lower half / quarter / eighth of the src register.

You can pmovsx directly from memory, though, although intrinsics make it really awkward. Since random throughput is more limited than load on most processors, it is probably preferable to make two loads + pmovsx than for one load + three shuffles.

0

Peter Cordes Jun 29 '16 at 13:17

source share

Paul r · Accepted Answer · 2012-05-17T14:18:35+0000

You need to unpack each vector of 8-bit values into two vectors of 16-bit values, and then add them.

 __m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 } __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }

where v is a vector of values 16 x 8 bits, and vl , vh are two unpacked vectors of 8 x 16 bits.

Please note that I assume that 8-bit values are unsigned, so when decompressing up to 16 bits, the high byte is set to 0 (i.e., without a character extension).

If you want to summarize a lot of these vectors and get a 32-bit result, a useful trick is to use _mm_madd_epi16 with a factor of 1, for example.

 __m128i vsuml = _mm_set1_epi32(0); __m128i vsumh = _mm_set1_epi32(0); __m128i vsum; int sum; for (int i = 0; i < N; i += 16) { __m128i v = _mm_load_si128(&x[i]); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1))); vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1))); } // do horizontal sum of 4 partial sums and store in scalar int vsum = _mm_add_epi32(vsuml, vsumh); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8)); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4)); sum = _mm_cvtsi128_si32(vsum);

SSE Instructions: Byte + Short

More articles: