I have array, called A, which contains 32 unsigned charvalues.
I want to unpack these values in 4 __m256with this rule, assuming that we have an index from 0 to 31 relative to all values from A, the unpacked 4 variable will have the following values:
B_0 = A[0], A[4], A[8], A[12], A[16], A[20], A[24], A[28]
B_1 = A[1], A[5], A[9], A[13], A[17], A[21], A[25], A[29]
B_2 = A[2], A[6], A[10], A[14], A[18], A[22], A[26], A[30]
B_3 = A[3], A[7], A[11], A[15], A[19], A[23], A[27], A[31]
For this, I have this code:
const auto mask = _mm256_set1_epi32( 0x000000FF );
...
const auto A_values = _mm256_i32gather_epi32(reinterpret_cast<const int*>(A.data(), A_positions.values_, 4);
const auto B_0 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 24), mask));
const auto B_1 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 16), mask));
const auto B_2 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 8), mask));
const auto B_3 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 0), mask));
This works fine, but I wonder if there is a faster way to do this, especially regarding the shift right and the operator that I use to extract the values.
, , array A 32, , , ( 4 uint8_t), _mm256_i32gather_epi23 . array .