Unfortunately, there is no instruction for this, even in AVX (none of me know). Therefore, you will have to do it manually, as it is now.
However, your current method is not very optimal, and you rely on .m128i_u8 , which is an extension of MSVC. Based on my experience with MSVC, it will use an alignment buffer to access individual items. This is a very difficult punishment due to incomplete access.
Instead of .m128i_u8 use _mm_extract_epi32() . This is in SSE4.1. But you are already relying on SSE4.1 with _mm_cvtepu8_epi32() .
This situation is especially bad, since you are working with 1-bit drillthrough. If instead you worked with 2-byte (16-bit integer) granularity, there is an effective solution using shuffle intrinsics .
source share