You need to mask the lower part and move the upper part to the correct position. Since the SSE instruction is non-byte-shifted, the top must also be masked after the shift.
__m128i b = _mm_load_si128((const __m128i*) ptr); __m128i mask = _mm_set1_epi8(0xf); __m128i lower = _mm_and_si128(b, mask); __m128i upper = _mm_and_si128(_mm_srli_epi16(b, 4), mask);
source share