The short answer . Saving 0 eliminates the dependency on reading after writing in one of the loops.
More details
I thought this was an interesting question, and although you focused on the O0 optimization level, the same acceleration is observed on O3. But, looking at O0, itβs easier to focus on what the processor does to optimize the code, not the compiler, because, as you noticed, the resulting assembly code differs only in 1 instruction.
The assembly code for the cycle of interest is shown below.
movq $0, -32(%rbp) jmp .L4 .L5: movq -32(%rbp), %rax movq -24(%rbp), %rdx andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax addq %rdx, %rax movl (%rax), %eax cltq addq %rax, -24(%rbp) addq $1, -32(%rbp) .L4: movl -36(%rbp), %eax cltq cmpq -32(%rbp), %rax jg .L5
Running with perf stat on my system. I get the following results:
Results for Slow Code
Performance counter stats for './slow_o0': 1827.438670 task-clock
Results for quick code
Performance counter stats for './fast_o0': 1109.451910 task-clock
So you can see that although the βfastβ code executes more instructions, it has fewer stalls. When a processor runs out of order (like most x64 architectures) executes code, it tracks dependencies between instructions. The waiting command can be bypassed by another command if the operands are ready.
In this example, the critical point is probably this sequence of commands:
andq %rdx, %rax movq %rax, -16(%rbp) movq $0, -16(%rbp) ;; This instruction in FAST but not SLOW movq -16(%rbp), %rax leaq 0(,%rax,4), %rdx movq -8(%rbp), %rax
In fast code, the command movq -8(%rbp), %rax will receive the result from movq $0, -16(%rbp) , redirected to it, and it will be able to execute earlier. While the slower version will have to wait for movq %rax, -16(%rbp) , which has more dependencies between iterations of the loop.
Note that without knowing more about the specific microarchitecture, this analysis is probably too simplistic. But I suspect that the main reason is this dependency and that executing the store 0 ( movq $0, -16(%rbp) ) allows the CPU to perform more aggressive speculation when executing the code sequence.