How can gcc-o3 do so fast?

Question

How can gcc-o3 do so fast?

Let test_speed.c be the following C code:

#include <stdio.h> int main(){ int i; for(i=0; i < 1000000000; i++) {} printf("%d", i); }

I run in terminal:

 gcc -o test_speed test_speed.c

and then:

 time ./test_speed

I get:

Now I run the following:

 gcc -O3 -o test_speed test_speed.c

and then:

 time ./test_speed

I get:

How can a second launch be so fast? Is this already computed at compile time?

+5

c gcc

gus87 Apr 3 '18 at 15:17

source share

4 answers

because -O3 aggressive optimization suggests that

 for(i=0; i < 1000000000; i++) {}

It has no side effect (except for the value of i ) and completely deletes the cycle (directly setting i to 1000000000 ).

Disassembly (x86):

 00000000 <_main>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 e4 f0 and $0xfffffff0,%esp 6: 83 ec 10 sub $0x10,%esp 9: e8 00 00 00 00 call e <_main+0xe> e: c7 44 24 04 00 ca 9a movl $0x3b9aca00,0x4(%esp) <== 1000000000 in hex, no loop 15: 3b 16: c7 04 24 00 00 00 00 movl $0x0,(%esp) 1d: e8 00 00 00 00 call 22 <_main+0x22> 22: 31 c0 xor %eax,%eax 24: c9 leave 25: c3 ret

that the optimization level is not suitable for calibrated cycles of active CPUs, as you can see (the result is the same as -O2 , but the cycle remains unoptimized only with -O )

+7

Jean-François Fabre Apr 3 '18 at 15:20

source share

The compiler recognizes that the loop does nothing and that deleting it will not change the output of the program, therefore the loop is fully optimized.

Here's the assembly with -O0 :

 .L3: .loc 1 4 0 is_stmt 0 discriminator 3 addl $1, -4(%rbp) .L2: .loc 1 4 0 discriminator 1 cmpl $999999999, -4(%rbp) # loop jle .L3 .loc 1 5 0 is_stmt 1 movl -4(%rbp), %eax movl %eax, %esi movl $.LC0, %edi movl $0, %eax call printf movl $0, %eax .loc 1 6 0 leave .cfi_def_cfa 7, 8 ret

And with -O3 :

 main: .LFB23: .file 1 "x1.c" .loc 1 2 0 .cfi_startproc .LVL0: subq $8, %rsp .cfi_def_cfa_offset 16 .LBB4: .LBB5: .file 2 "/usr/include/x86_64-linux-gnu/bits/stdio2.h" .loc 2 104 0 movl $1000000000, %edx # stored value, no loop movl $.LC0, %esi movl $1, %edi xorl %eax, %eax call __printf_chk .LVL1: .LBE5: .LBE4: .loc 1 6 0 xorl %eax, %eax addq $8, %rsp .cfi_def_cfa_offset 8 ret

You can see that in the case of -O3 cycle is completely deleted, and the final value of i , 1,000,000,000 is saved directly.

+2

dbush Apr 3 '18 at 15:20

source share

The compiler should only support observable program behavior. Counting a variable without I / O, interaction, or simply using its value is not observed, since your loop does nothing, the optimizer simply discards it completely and directly assigns the final value.

+2

Felix palmen Apr 3 '18 at 15:20

source share

Dirk eddelbuettel · Accepted Answer · 2018-04-03T15:20:15+0000

gcc "knows" that there is no body in the cycle and there is no dependence on any result, temporary or real - therefore it deletes the cycle.

A good tool for such analysis is godbolt.org , which shows the generated assembly. The difference between the lack of optimization and the optimization of -O3 is small:

How can gcc-o3 do so fast?

No optimization

C-O3

More articles: