C ++: runtime difference between two virtual function calls

Consider this code under gcc 4.5.1 (Ubuntu 10.04, intel core2duo 3.0 Ghz) These are just 2 tests, in the first I make a direct call to virtual fucnion, and in the second I call it through the Wrapper class:

test.cpp

#define ITER 100000000 class Print{ public: typedef Print* Ptr; virtual void print(int p1, float p2, float p3, float p4){/*DOES NOTHING */} }; class PrintWrapper { public: typedef PrintWrapper* Ptr; PrintWrapper(Print::Ptr print, int p1, float p2, float p3, float p4) : m_print(print), _p1(p1),_p2(p2),_p3(p3),_p4(p4){} ~PrintWrapper(){} void execute() { m_print->print(_p1,_p2,_p3,_p4); } private: Print::Ptr m_print; int _p1; float _p2,_p3,_p4; }; Print::Ptr p = new Print(); PrintWrapper::Ptr pw = new PrintWrapper(p, 1, 2.f,3.0f,4.0f); void test1() { //-------------test 1------------------------- for (auto var = 0; var < ITER; ++var) { p->print(1, 2.f,3.0f,4.0f); } } void test2() { //-------------test 2------------------------- for (auto var = 0; var < ITER; ++var) { pw->execute(); } } int main() { test1(); test2(); } 

I profiled it with gprof and objdump:

 g++ -c -std=c++0x -pg -g -O2 test.cpp objdump -d -M intel -S test.o > objdump.txt g++ -pg test.o -o test ./test gprof test > gprof.output 

in gprof.output I noticed that test2 () takes longer than test1 (), but I cannot explain it

 Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 49.40 0.41 0.41 1 410.00 540.00 test2() 31.33 0.67 0.26 200000000 0.00 0.00 Print::print(int, float, float, float) 19.28 0.83 0.16 1 160.00 290.00 test1() 0.00 0.83 0.00 1 0.00 0.00 global constructors keyed to p 

The build code in objdump.txt doesn't help me either:

  //-------------test 1------------------------- for (auto var = 0; var < ITER; ++var) 15: 83 c3 01 add ebx,0x1 { p->print(1, 2.f,3.0f,4.0f); 18: 8b 10 mov edx,DWORD PTR [eax] 1a: c7 44 24 10 00 00 80 mov DWORD PTR [esp+0x10],0x40800000 21: 40 22: c7 44 24 0c 00 00 40 mov DWORD PTR [esp+0xc],0x40400000 29: 40 2a: c7 44 24 08 00 00 00 mov DWORD PTR [esp+0x8],0x40000000 31: 40 32: c7 44 24 04 01 00 00 mov DWORD PTR [esp+0x4],0x1 39: 00 3a: 89 04 24 mov DWORD PTR [esp],eax 3d: ff 12 call DWORD PTR [edx] //-------------test 2------------------------- for (auto var = 0; var < ITER; ++var) 65: 83 c3 01 add ebx,0x1 ~PrintWrapper(){} void execute() { m_print->print(_p1,_p2,_p3,_p4); 68: 8b 10 mov edx,DWORD PTR [eax] 6a: 8b 70 10 mov esi,DWORD PTR [eax+0x10] 6d: 8b 0a mov ecx,DWORD PTR [edx] 6f: 89 74 24 10 mov DWORD PTR [esp+0x10],esi 73: 8b 70 0c mov esi,DWORD PTR [eax+0xc] 76: 89 74 24 0c mov DWORD PTR [esp+0xc],esi 7a: 8b 70 08 mov esi,DWORD PTR [eax+0x8] 7d: 89 74 24 08 mov DWORD PTR [esp+0x8],esi 81: 8b 40 04 mov eax,DWORD PTR [eax+0x4] 84: 89 14 24 mov DWORD PTR [esp],edx 87: 89 44 24 04 mov DWORD PTR [esp+0x4],eax 8b: ff 11 call DWORD PTR [ecx] 

How can we explain this difference?

+6
source share
4 answers

In test2() program must first load pw from the heap, and then call pw->execute() (which carries the overhead for the call), then load pw->m_print , as well as the arguments _p1 via _p4 then load the vtable pointer for pw , then load the vtable slot for pw->Print , then call pw->Print . Since the compiler cannot see through the virtual call, it must assume that all of these values ​​have changed for the next iteration and have reloaded them all.

In test() arguments are built into the code segment, and we only need to load p , the vtable pointer and the vtable slot. Thus, we saved five loads. This can easily explain the time difference.

In short - here pw->m_print and pw->_p1 through pw->_p4 .

+3
source

One difference is that the values ​​you pass to print1 will be stored in the instructions themselves, while the material in PrintWrapper must be loaded from the heap. You can see how this happens in assembler. For this reason, different memory access times may occur.

+2
source

In a direct call, the compiler can optimize the virtuality of the function because the type p is known at compile time (since only the purpose of p visible). In PrintWrapper type is erased and a virtual function call must be made.

+1
source

Are you really printing or just calling the Print function, which does nothing? If you actually type, you weigh the hair on the pig.

Regardless, gprof is blind to I / O, so it only looks at your CPU usage.

Note: Test2 takes 11 steps before the call, while Test1 only does 6. So if more PC samples land in Test2, this is not surprising.

+1
source

Source: https://habr.com/ru/post/903228/


All Articles