GCC generated 64-bit code 3 times slower than 32 bit

Question

GCC generated 64-bit code 3 times slower than 32 bit

I noticed that my code runs on 64-bit Linux much slower than on 32-bit Linux or 64-bit Windows or 64-bit Mac. This is a minimal test case.

#include <stdlib.h> typedef unsigned char UINT8; void stretch(UINT8 * lineOut, UINT8 * lineIn, int xsize, float *kk) { int xx, x; for (xx = 0; xx < xsize; xx++) { float ss = 0.0; for (x = 0; x < xsize; x++) { ss += lineIn[x] * kk[x]; } lineOut[xx] = (UINT8) ss; } } int main( int argc, char** argv ) { int i; int xsize = 2048; UINT8 *lineIn = calloc(xsize, sizeof(UINT8)); UINT8 *lineOut = calloc(xsize, sizeof(UINT8)); float *kk = calloc(xsize, sizeof(float)); for (i = 0; i < 1024; i++) { stretch(lineOut, lineIn, xsize, kk); } return 0; }

And here is how it works:

 $ cc --version cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 $ cc -O2 -Wall -m64 ./tt.c -o ./tt && time ./tt user 14.166s $ cc -O2 -Wall -m32 ./tt.c -o ./tt && time ./tt user 5.018s

As you can see, the 32-bit version is almost 3 times faster (I tested on both 32-bit and 64-bit Ubuntu, the same thing). And even stranger, performance depends on the C standard:

 $ cc -O2 -Wall -std=c99 -m32 ./tt.c -o ./tt && time ./tt user 15.825s $ cc -O2 -Wall -std=gnu99 -m32 ./tt.c -o ./tt && time ./tt user 5.090s

How can it be? How can I get around this to speed up the 64-bit version generated by GCC.

Update 1

I compared the assembler created by fast 32-bit (by default and gnu99) and slow (c99), and found the following:

 .L5: movzbl (%ebx,%eax), %edx # MEM[base: lineIn_10(D), index: _72, offset: 0B], D.1543 movl %edx, (%esp) # D.1543, fildl (%esp) # fmuls (%esi,%eax,4) # MEM[base: kk_18(D), index: _72, step: 4, offset: 0B] addl $1, %eax #, x cmpl %ecx, %eax # xsize, x faddp %st, %st(1) #, fstps 12(%esp) # flds 12(%esp) # jne .L5 #,

In quick cases there are no fstps and flds . Thus, GCC saves and loads the value from memory at each step. I tried the register float , but that does not help.

Update 2

I tested gcc-4.9 and it seems to generate the optimal code for 64 bits. And -ffast-math (suggested by @jch) fixes -m32 -std=c99 for both versions of GCC. I'm still looking for a solution for 64 bits on gcc-4.8, because now it is a more common version than 4.9.

+6

performance c gcc 64bit

homm Oct 27 '14 at 10:55

source share

4 answers

Here is what I tried: I declared ss as volatile . This prevented the compiler from doing optimization on it. I got similar times for 32 and 64 bit versions.

64 bits was a bit slower, but that's fine, because the 64-bit code is larger and the uCode cache has a finite size. Thus, in general, 64 bits should be slightly slower than 32 (<3-4%). A.

Returning to the problem, I think that in 32-bit mode the compiler makes more aggressive optimizations on ss.

Update 1:

Looking at the 64-bit code, it generates the CVTTSS2SI instruction in combination with the CVTSI2SS instruction for converting from a floating point to an integer. This has a higher latency. The 32-bit code simply uses the FMULS instruction that runs directly on the floats. You must look for a compiler option to prevent these conversions.

+2

VAndrei Oct 27 '14 at 11:36

source share

In 32-bit mode, the compiler makes an extra effort to maintain strict IEEE 754 floating point semantics. You can avoid this by compiling with -ffast-math :

 $ gcc -m32 -O2 -std=c99 test.c && time ./a.out real 0m13.869s user 0m13.884s sys 0m0.000s $ gcc -m32 -O2 -std=c99 -ffast-math test.c && time ./a.out real 0m4.477s user 0m4.480s sys 0m0.000s

I cannot reproduce your results in 64-bit mode, but I am sure that -ffast-math will solve your problems. More generally, if you really don't want the reproducible rounding behavior of IEEE 754, -ffast-math is what you want.

+2

jch Oct 27 '14 at 12:26

source share

Similar to a case of limitation. Three arrays cannot intersect, right?

+1

Msalters Oct 27 '14 at 11:18

source share

Vyacheslav egorov · Accepted Answer · 2014-10-27T12:56:01+0000

The code created by older versions of GCC has a partial dependency batch.

 movzbl (%rsi,%rax), %r8d cvtsi2ss %r8d, %xmm0 ;; all upper bits in %xmm0 are false dependency

Dependency may be broken on xorps .

 #ifdef __SSE__ float __attribute__((always_inline)) i2f(int v) { float x; __asm__("xorps %0, %0; cvtsi2ss %1, %0" : "=X"(x) : "r"(v) ); return x; } #else float __attribute__((always_inline)) i2f(int v) { return (float) v; } #endif void stretch(UINT8* lineOut, UINT8* lineIn, int xsize, float *kk) { int xx, x; for (xx = 0; xx < xsize; xx++) { float ss = 0.0; for (x = 0; x < xsize; x++) { ss += i2f(lineIn[x]) * kk[x]; } lineOut[xx] = (UINT8) ss; } }

results

 $ cc -O2 -Wall -m64 ./test.c -o ./test64 && time ./test64 ./test64 4.07s user 0.00s system 99% cpu 4.070 total $ cc -O2 -Wall -m32 ./test.c -o ./test32 && time ./test32 ./test32 3.94s user 0.00s system 99% cpu 3.938 total

GCC generated 64-bit code 3 times slower than 32 bit

More articles: