Why do two consecutive collection commands perform worse than equivalent elementary operations?

I am updating the code from SSE to AVX2. In general, I see that the assembly instructions are very useful and benefit the performance. However, I came across a situation where collection commands are less efficient than decomposing collection operations into simpler ones.

In the code below, I have an int32 vector b, a vector of double xiand 4 int32 indices, packed in 128-bit register bidx. I need to first assemble a vector bthan from a vector xi. Ie, in pseudo code, I need to do:

__m128i i = b[idx];
__m256d x = xi[i];

In the function below, I implement this in two ways, using #ifdef: through collection commands providing throughput of 290 Miter / sec and through elementary operations providing throughput of 325 Miter / sec.

Can someone explain what is happening? Thanks

inline void resolve( const __m256d& z, const __m128i& bidx, int32_t j
                    , const int32_t *b, const double *xi, int32_t* ri )
{

    __m256d x;
    __m128i i;

#if 0  // this code uses two gather instructions in sequence

    i = _mm_i32gather_epi32(b, bidx, 4));  // i = b[bidx]
    x = _mm256_i32gather_pd(xi, i, 8);     // x = xi[i]

#else  // this code does not use gather instructions

    union {
            __m128i vec;
            int32_t i32[4];
    } u;
    x = _mm256_set_pd
            ( xi[(u.i32[3] = b[_mm_extract_epi32(bidx,3)])]
            , xi[(u.i32[2] = b[_mm_extract_epi32(bidx,2)])]
            , xi[(u.i32[1] = b[_mm_extract_epi32(bidx,1)])]
            , xi[(u.i32[0] = b[_mm_cvtsi128_si32(bidx)  ])]
            );
    i = u.vec;

#endif

    // here we use x and i
    __m256  ps256 = _mm256_castpd_ps(_mm256_cmp_pd(z, x, _CMP_LT_OS));
    __m128  lo128 = _mm256_castps256_ps128(ps256);
    __m128  hi128 = _mm256_extractf128_ps(ps256, 1);
    __m128  blend = _mm_shuffle_ps(lo128, hi128, 0 + (2<<2) + (0<<4) + (2<<6));
    __m128i lt    = _mm_castps_si128(blend);  // this is 0 or -1
    i = _mm_add_epi32(i, lt);
    _mm_storeu_si128(reinterpret_cast<__m128i*>(ri)+j, i);
}
+4
source share
1 answer

Since your "Allow" function is marked as built-in, I assume that it called in a high-frequency cycle. Then you can also look at the dependencies of the input parameters from each other outside the "enable" function. The compiler can optimize inline code better across loop boundaries by using the scalar code option.

-1

Source: https://habr.com/ru/post/1658465/


All Articles