You need to unpack each vector of 8-bit values ββinto two vectors of 16-bit values, and then add them.
__m128i v = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); // vl = { 7, 6, 5, 4, 3, 2, 1, 0 } __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); // vh = { 15, 14, 13, 12, 11, 10, 9, 8 }
where v
is a vector of values ββ16 x 8 bits, and vl
, vh
are two unpacked vectors of 8 x 16 bits.
Please note that I assume that 8-bit values ββare unsigned, so when decompressing up to 16 bits, the high byte is set to 0 (i.e., without a character extension).
If you want to summarize a lot of these vectors and get a 32-bit result, a useful trick is to use _mm_madd_epi16
with a factor of 1, for example.
__m128i vsuml = _mm_set1_epi32(0); __m128i vsumh = _mm_set1_epi32(0); __m128i vsum; int sum; for (int i = 0; i < N; i += 16) { __m128i v = _mm_load_si128(&x[i]); __m128i vl = _mm_unpacklo_epi8(v, _mm_set1_epi8(0)); __m128i vh = _mm_unpackhi_epi8(v, _mm_set1_epi8(0)); vsuml = _mm_add_epi32(vsuml, _mm_madd_epi16(vl, _mm_set1_epi16(1))); vsumh = _mm_add_epi32(vsumh, _mm_madd_epi16(vh, _mm_set1_epi16(1))); } // do horizontal sum of 4 partial sums and store in scalar int vsum = _mm_add_epi32(vsuml, vsumh); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 8)); vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4)); sum = _mm_cvtsi128_si32(vsum);