I have a highly optimized function, repeated many times in the inner loop, written with SSE2 / AVX2 accelerations. After some refinement, I am now approaching the theoretical best performance (based on latency and bandwidth of the team). However, performance is not entirely portable. The problem is that there are more than 16 variables __m128i/ __256i. Of course, only 16 of them can be allocated in registers, and the rest on the stack. The function is more or less similar to the following,
void eval(size_t n, __m128i *rk )
{
__m128i xmmk0 = rk[0];
__m128i xmmk6 = rk[6];
__m128i xmmk;
__m128i xmmk[Rounds - 6];
while (n >= 8) {
n -= 8;
__m128i xmm0 =
xmm0 = ;
xmm0 = ;
}
}
16 __m128i . , , - xmm0 xmm7, , xmmk0 xmmk6, 7 , , . , , GCC/clang , Intel ICPC xmm0 to xmm7 . ,
__m128i xmmk[Rounds + 1]; // copy from input rk
// let compiler to figure out which of them are allocated on stack and which in registers,
GCC/ICPC , clang , ICPC .
, __m128i , .
ASM, , . , , . ++ .
, , . - - L1. , , , - 20%. , - , . , . , , . , , xmm0 xmm7.