Gcc simple performance loop arithmetic

Question

Gcc simple performance loop arithmetic

Problem: one, obviously, an additional line of code speeds up the program almost twice.

It is rather difficult to formulate the initial problem; it proceeds from the control boundary exclusion algorithm. So, just some simple test that I cannot understand.

Obviously, an extra line of code almost doubles the program speed.

The following source exists:

#include <stdlib.h> #include <stdio.h> int main(void) { long i = 0, a = 0, x = 0; int up = 200000000; int *values = malloc(sizeof(int)*up); for (i = 0; i < up ; ++i) { values[i]=i % 2; } for (i = 0; i < up ; ++i) { x = (a & i); #ifdef FAST x = 0; #endif a += values[x]; } printf ("a=%ld\n", a); return 0; }/*main*/

In this example, the value of 'a' is always 0. Line x = 0; Additionally.

But, (see - NOT ANY OPTIMIZATION!)
$ gcc -O0 -o short short.c && time. / short
a = 0
real 0m2.808s
user 0m2.196s
sys 0m0.596s

$ gcc -O0 -DFAST - short short.c && & time. / short
a = 0
real 0m1.869s
user 0m1.260s
sys 0m0.608s

And this is reproduced in many compiler / optimization options and program options.

In addition, they really lead to the same assembler, except that it puts that stupid extra 0 in some register! For instance:.

gcc -S -O0 -DFAST short.c && mv short.s shortFAST.s
gcc -S-O0 short.c && & mv short.s shortSLOW.s
diff shortFAST.s shortSLOW.s
55d54
<movq $ 0, -24 (% rbp)

And a little later - the same effect on some (all that I could check) other compilers / languages (including Java JIT). The only thing that was shared was the x86-64 architecture. It has been tested on Intel and AMD processors ...

+5

c gcc

Mikhail Tentyukov Oct 20 '14 at 20:25

source share

1 answer

Gabriel southern · Accepted Answer · 2014-10-20T21:31:02+0000

The short answer . Saving 0 eliminates the dependency on reading after writing in one of the loops.

More details

I thought this was an interesting question, and although you focused on the O0 optimization level, the same acceleration is observed on O3. But, looking at O0, it’s easier to focus on what the processor does to optimize the code, not the compiler, because, as you noticed, the resulting assembly code differs only in 1 instruction.

The assembly code for the cycle of interest is shown below.

  movq $0, -32(%rbp) jmp .L4 .L5: movq -32(%rbp), %rax movq -24(%rbp), %rdx andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax addq %rdx, %rax movl (%rax), %eax cltq addq %rax, -24(%rbp) addq $1, -32(%rbp) .L4: movl -36(%rbp), %eax cltq cmpq -32(%rbp), %rax jg .L5

Running with perf stat on my system. I get the following results:

Results for Slow Code

 Performance counter stats for './slow_o0': 1827.438670 task-clock # 0.999 CPUs utilized 155 context-switches # 0.085 K/sec 1 CPU-migrations # 0.001 K/sec 195,448 page-faults # 0.107 M/sec 6,675,246,466 cycles # 3.653 GHz 4,391,690,661 stalled-cycles-frontend # 65.79% frontend cycles idle 1,609,321,845 stalled-cycles-backend # 24.11% backend cycles idle 7,157,837,211 instructions # 1.07 insns per cycle # 0.61 stalled cycles per insn 490,110,757 branches # 268.195 M/sec 178,287 branch-misses # 0.04% of all branches 1.829712061 seconds time elapsed

Results for quick code

  Performance counter stats for './fast_o0': 1109.451910 task-clock # 0.998 CPUs utilized 95 context-switches # 0.086 K/sec 1 CPU-migrations # 0.001 K/sec 195,448 page-faults # 0.176 M/sec 4,067,613,078 cycles # 3.666 GHz 1,784,131,209 stalled-cycles-frontend # 43.86% frontend cycles idle 438,447,105 stalled-cycles-backend # 10.78% backend cycles idle 7,356,892,998 instructions # 1.81 insns per cycle # 0.24 stalled cycles per insn 489,945,197 branches # 441.610 M/sec 176,136 branch-misses # 0.04% of all branches 1.111398442 seconds time elapsed

So you can see that although the “fast” code executes more instructions, it has fewer stalls. When a processor runs out of order (like most x64 architectures) executes code, it tracks dependencies between instructions. The waiting command can be bypassed by another command if the operands are ready.

In this example, the critical point is probably this sequence of commands:

  andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax

In fast code, the command movq -8(%rbp), %rax will receive the result from movq $0, -16(%rbp) , redirected to it, and it will be able to execute earlier. While the slower version will have to wait for movq %rax, -16(%rbp) , which has more dependencies between iterations of the loop.

Note that without knowing more about the specific microarchitecture, this analysis is probably too simplistic. But I suspect that the main reason is this dependency and that executing the store 0 ( movq $0, -16(%rbp) ) allows the CPU to perform more aggressive speculation when executing the code sequence.

Gcc simple performance loop arithmetic

More articles: