The fastest way to decompress 8 bits from 32-bit values ​​(__m256i) to __m256 with AVX2

I have array, called A, which contains 32 unsigned charvalues.

I want to unpack these values ​​in 4 __m256with this rule, assuming that we have an index from 0 to 31 relative to all values ​​from A, the unpacked 4 variable will have the following values:

B_0 = A[0], A[4],  A[8], A[12], A[16], A[20], A[24], A[28]
B_1 = A[1], A[5],  A[9], A[13], A[17], A[21], A[25], A[29]
B_2 = A[2], A[6], A[10], A[14], A[18], A[22], A[26], A[30]
B_3 = A[3], A[7], A[11], A[15], A[19], A[23], A[27], A[31]

For this, I have this code:

const auto mask = _mm256_set1_epi32( 0x000000FF );
...
const auto A_values = _mm256_i32gather_epi32(reinterpret_cast<const int*>(A.data(), A_positions.values_, 4);

// This code bellow is equivalent to B_0 = static_cast<float>((A_value >> 24) & 0x000000FF)
const auto B_0 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 24), mask));
const auto B_1 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 16), mask));
const auto B_2 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 8), mask));
const auto B_3 = _mm256_cvtepi32_ps(_mm256_and_si256(_mm256_srai_epi32(A_values, 0), mask));

This works fine, but I wonder if there is a faster way to do this, especially regarding the shift right and the operator that I use to extract the values.

, , array A 32, , , ( 4 uint8_t), _mm256_i32gather_epi23 . array .

+4
1

/ vpshufb. , , , -. , , , .

, Intel, recip.throughput 0.5 AND 0.33, , 1, ( Intel AVX2, , P5). μops, , , . P01 ( FP SIMD), μops P5, , .

Ryzen , , , . 256b vpsrad 2 μops, 2 ( μops vpand, alu), 256b vpshufb 2 μops, 1 2. , , . , μops, , , P12, .

, , .

+4

Source: https://habr.com/ru/post/1683453/


All Articles