I noticed that my code runs on 64-bit Linux much slower than on 32-bit Linux or 64-bit Windows or 64-bit Mac. This is a minimal test case.
#include <stdlib.h> typedef unsigned char UINT8; void stretch(UINT8 * lineOut, UINT8 * lineIn, int xsize, float *kk) { int xx, x; for (xx = 0; xx < xsize; xx++) { float ss = 0.0; for (x = 0; x < xsize; x++) { ss += lineIn[x] * kk[x]; } lineOut[xx] = (UINT8) ss; } } int main( int argc, char** argv ) { int i; int xsize = 2048; UINT8 *lineIn = calloc(xsize, sizeof(UINT8)); UINT8 *lineOut = calloc(xsize, sizeof(UINT8)); float *kk = calloc(xsize, sizeof(float)); for (i = 0; i < 1024; i++) { stretch(lineOut, lineIn, xsize, kk); } return 0; }
And here is how it works:
$ cc --version cc (Ubuntu 4.8.2-19ubuntu1) 4.8.2 $ cc -O2 -Wall -m64 ./tt.c -o ./tt && time ./tt user 14.166s $ cc -O2 -Wall -m32 ./tt.c -o ./tt && time ./tt user 5.018s
As you can see, the 32-bit version is almost 3 times faster (I tested on both 32-bit and 64-bit Ubuntu, the same thing). And even stranger, performance depends on the C standard:
$ cc -O2 -Wall -std=c99 -m32 ./tt.c -o ./tt && time ./tt user 15.825s $ cc -O2 -Wall -std=gnu99 -m32 ./tt.c -o ./tt && time ./tt user 5.090s
How can it be? How can I get around this to speed up the 64-bit version generated by GCC.
Update 1
I compared the assembler created by fast 32-bit (by default and gnu99) and slow (c99), and found the following:
.L5: movzbl (%ebx,%eax), %edx
In quick cases there are no fstps and flds . Thus, GCC saves and loads the value from memory at each step. I tried the register float , but that does not help.
Update 2
I tested gcc-4.9 and it seems to generate the optimal code for 64 bits. And -ffast-math (suggested by @jch) fixes -m32 -std=c99 for both versions of GCC. I'm still looking for a solution for 64 bits on gcc-4.8, because now it is a more common version than 4.9.