Is it possible to affect the allocation of GCC / Clang / Intel ICPC XMM / YMM registers?

I have a highly optimized function, repeated many times in the inner loop, written with SSE2 / AVX2 accelerations. After some refinement, I am now approaching the theoretical best performance (based on latency and bandwidth of the team). However, performance is not entirely portable. The problem is that there are more than 16 variables __m128i/ __256i. Of course, only 16 of them can be allocated in registers, and the rest on the stack. The function is more or less similar to the following,

void eval(size_t n, __m128i *rk /* other inputs */)
{
    __m128i xmmk0 = rk[0];
    // ...
    __m128i xmmk6 = rk[6];
    __m128i xmmk;

    __m128i xmmk[Rounds - 6];
    // copy rk[7] to r[Rounds] to xmmk

    while (n >= 8) {
        n -= 8;

        __m128i xmm0 = /* initialize state xmm0 */
        // do the same for xmm1 - xmm7

        // round 0
        xmm0 = /* a few instructions involving xmm0 and xmmk0 */;
        // do the same for xmm1 - xmm7

        // do the same for round 1 to 6, using xmmk1, ..., xmmk6

        // round 7, copy xmmk[0] to a temporary __m128i variable
        xmm0 = /* a few instructions involving xmm0 and xmmk[0] */;
        // do the same for xmm1 - xmm7

        // do the same for round 7 to Rounds, using xmmk[1], xmmk[Rounds - 7]
    }
}

16 __m128i . , , - xmm0 xmm7, , xmmk0 xmmk6, 7 , , . , , GCC/clang , Intel ICPC xmm0 to xmm7 . ,

__m128i xmmk[Rounds + 1]; // copy from input rk
// let compiler to figure out which of them are allocated on stack and which in registers,

GCC/ICPC , clang , ICPC .

, __m128i , .

ASM, , . , , . ++ .

, , . - - L1. , , , - 20%. , - , . , . , , . , , xmm0 xmm7.

+4

Source: https://habr.com/ru/post/1660598/


All Articles