Beat compiler

I am trying to use Intel intrinsics to defeat optimized code compiler. Sometimes I can do it, sometimes I can’t.

I think the question is, why do I sometimes beat the compiler, but in other cases not? I got a time of 0.006 seconds for operator+= lower using Intel intrinsics (vs 0.009 when using bare C ++), but a time of 0.07 s for operator+ using intrinsics, while bare C ++ was only 0.03 with.

 #include <windows.h> #include <stdio.h> #include <intrin.h> class Timer { LARGE_INTEGER startTime ; double fFreq ; public: Timer() { LARGE_INTEGER freq ; QueryPerformanceFrequency( &freq ) ; fFreq = (double)freq.QuadPart ; reset(); } void reset() { QueryPerformanceCounter( &startTime ) ; } double getTime() { LARGE_INTEGER endTime ; QueryPerformanceCounter( &endTime ) ; return ( endTime.QuadPart - startTime.QuadPart ) / fFreq ; // as double } } ; inline float randFloat(){ return (float)rand()/RAND_MAX ; } // Use my optimized code, #define OPTIMIZED_PLUS_EQUALS #define OPTIMIZED_PLUS union Vector { struct { float x,y,z,w ; } ; __m128 reg ; Vector():x(0.f),y(0.f),z(0.f),w(0.f) {} Vector( float ix, float iy, float iz, float iw ):x(ix),y(iy),z(iz),w(iw) {} //Vector( __m128 val ):x(val.m128_f32[0]),y(val.m128_f32[1]),z(val.m128_f32[2]),w(val.m128_f32[3]) {} Vector( __m128 val ):reg( val ) {} // 2x speed, above inline Vector& operator+=( const Vector& o ) { #ifdef OPTIMIZED_PLUS_EQUALS // YES! I beat it! Using this intrinsic is faster than just C++. reg = _mm_add_ps( reg, o.reg ) ; #else x+=ox, y+=oy, z+=oz, w+=ow ; #endif return *this ; } inline Vector operator+( const Vector& o ) { #ifdef OPTIMIZED_PLUS // This is slower return Vector( _mm_add_ps( reg, o.reg ) ) ; #else return Vector( x+ox, y+oy, z+oz, w+ow ) ; #endif } static Vector random(){ return Vector( randFloat(), randFloat(), randFloat(), randFloat() ) ; } void print() { printf( "%.2f %.2f %.2f\n", x,y,z,w ) ; } } ; int runs = 8000000 ; Vector sum ; // OPTIMIZED_PLUS_EQUALS (intrinsics) runs FASTER 0.006 intrinsics, vs 0.009 (std C++) void test1(){ for( int i = 0 ; i < runs ; i++ ) sum += Vector(1.f,0.25f,0.5f,0.5f) ;//Vector::random() ; } // OPTIMIZED* runs SLOWER (0.03 for reg.C++, vs 0.07 for intrinsics) void test2(){ float j = 27.f ; for( int i = 0 ; i < runs ; i++ ) { sum += Vector( j*i, i, i/j, i ) + Vector( i, 2*i*j, 3*i*j*j, 4*i ) ; } } int main() { Timer timer ; //test1() ; test2() ; printf( "Time: %f\n", timer.getTime() ) ; sum.print() ; } 

Edit

Why am I doing this? The VS 2012 profiler tells me that my vector arithmetic operations may use some tweaking.

enter image description here

+4
source share
1 answer

As noted by Mystical, union hacking is the most likely culprit in test2 . It forces the data to pass through the L1 cache, which, being fast, has some delay, which is much larger than your 2-cycle gain, which the vector code offers (see below).

But also keep in mind that the CPU can run several instructions out of order and in parallel (superscalar CPU). For example, Sandy Bridge has 6 execution units, p0 - p5, floating point multiplication / division operations by p0, floating point addition and integer multiplication are performed by p1. In addition, the division takes 3-4 times more cycles, then multiplication / addition and not pipelined (that is, the execution unit cannot run another command while the division is performed). Thus, in test2 , while the vector code is waiting for expensive division and some multiplications to complete on the p0 block, the scalar code can execute an additional 2 add instructions on p1, which most likely destroys any advantage of the vector instructions.

test1 is different, the constant vector can be stored in the xmm register, in which case the loop contains only the add statement. But the code is not 3 times faster, as you might expect. The reason is pipelined instructions: each add command has a latency of 3 cycles, but the CPU can start a new each cycle when they are independent of each other. This refers to adding a vector to a component. Therefore, the vector code executes one add command for each iteration of the cycle with a delay of 3 cycles, and the scalar code performs 3 add instructions using only 5 cycles (1 started / cycle, and the third has a delay of 3: 2 + 3 = 5).

A very good resource on processor architecture and optimization - http://www.agner.org/optimize/

+5
source

Source: https://habr.com/ru/post/1439670/


All Articles