How can gcc-o3 do so fast?

Let test_speed.c be the following C code:

#include <stdio.h> int main(){ int i; for(i=0; i < 1000000000; i++) {} printf("%d", i); } 

I run in terminal:

 gcc -o test_speed test_speed.c 

and then:

 time ./test_speed 

I get:

enter image description here

Now I run the following:

 gcc -O3 -o test_speed test_speed.c 

and then:

 time ./test_speed 

I get:

enter image description here

How can a second launch be so fast? Is this already computed at compile time?

+5
source share
4 answers

gcc "knows" that there is no body in the cycle and there is no dependence on any result, temporary or real - therefore it deletes the cycle.

A good tool for such analysis is godbolt.org , which shows the generated assembly. The difference between the lack of optimization and the optimization of -O3 is small:

No optimization

enter image description here

C-O3

enter image description here

+3
source

because -O3 aggressive optimization suggests that

 for(i=0; i < 1000000000; i++) {} 

It has no side effect (except for the value of i ) and completely deletes the cycle (directly setting i to 1000000000 ).

Disassembly (x86):

 00000000 <_main>: 0: 55 push %ebp 1: 89 e5 mov %esp,%ebp 3: 83 e4 f0 and $0xfffffff0,%esp 6: 83 ec 10 sub $0x10,%esp 9: e8 00 00 00 00 call e <_main+0xe> e: c7 44 24 04 00 ca 9a movl $0x3b9aca00,0x4(%esp) <== 1000000000 in hex, no loop 15: 3b 16: c7 04 24 00 00 00 00 movl $0x0,(%esp) 1d: e8 00 00 00 00 call 22 <_main+0x22> 22: 31 c0 xor %eax,%eax 24: c9 leave 25: c3 ret 

that the optimization level is not suitable for calibrated cycles of active CPUs, as you can see (the result is the same as -O2 , but the cycle remains unoptimized only with -O )

+7
source

The compiler recognizes that the loop does nothing and that deleting it will not change the output of the program, therefore the loop is fully optimized.

Here's the assembly with -O0 :

 .L3: .loc 1 4 0 is_stmt 0 discriminator 3 addl $1, -4(%rbp) .L2: .loc 1 4 0 discriminator 1 cmpl $999999999, -4(%rbp) # loop jle .L3 .loc 1 5 0 is_stmt 1 movl -4(%rbp), %eax movl %eax, %esi movl $.LC0, %edi movl $0, %eax call printf movl $0, %eax .loc 1 6 0 leave .cfi_def_cfa 7, 8 ret 

And with -O3 :

 main: .LFB23: .file 1 "x1.c" .loc 1 2 0 .cfi_startproc .LVL0: subq $8, %rsp .cfi_def_cfa_offset 16 .LBB4: .LBB5: .file 2 "/usr/include/x86_64-linux-gnu/bits/stdio2.h" .loc 2 104 0 movl $1000000000, %edx # stored value, no loop movl $.LC0, %esi movl $1, %edi xorl %eax, %eax call __printf_chk .LVL1: .LBE5: .LBE4: .loc 1 6 0 xorl %eax, %eax addq $8, %rsp .cfi_def_cfa_offset 8 ret 

You can see that in the case of -O3 cycle is completely deleted, and the final value of i , 1,000,000,000 is saved directly.

+2
source

The compiler should only support observable program behavior. Counting a variable without I / O, interaction, or simply using its value is not observed, since your loop does nothing, the optimizer simply discards it completely and directly assigns the final value.

+2
source

Source: https://habr.com/ru/post/1276233/


All Articles