How much overhead when calling a function in C ++?

Numerous publications describe the use of built-in functions to "avoid the overhead of calling a function." However, I have not seen quantitative data. What is the actual overhead of calling a function, that is, what performance gain is achieved by embedding functions?

+49
c ++ function overhead
Sep 28 '08 at 2:07
source share
16 answers

In most architectures, the cost consists of saving all (or some or none) of the registers on the stack, pushing the arguments of the function onto the stack (or pushing them into the registers), increasing the stack pointer and moving to the beginning of a new code. Then, when the function is complete, you need to restore the registers from the stack. This web page describes what is associated with the various calling conventions.

Most C ++ compilers are now smart enough for the built-in functions for you. The inline keyword is just a hint to the compiler. Some of them even make an investment between translation units, where they decide that it is useful.

+41
Sep 28 '08 at 2:14
source share

There is a technical and practical answer. The practical answer is that it will never matter, and in a very rare case, it does the only way that you learn through actual profiled tests.

The technical answer referenced by your literature usually does not matter due to compiler optimization. But if you're still interested, this is well described by Josh .

As for the "percentage", you need to know how expensive the function itself was. There is no interest outside the cost of the function being called because you are comparing yourself to a zero-cost operation. There is no overhead for the embedded code; the processor simply proceeds to the next instruction. The disadvantage of inling is the larger code size, which shows that it costs differently than the cost of building / writing off the stack.

+11
Sep 28 '08 at 2:53
source share

The amount of overhead will depend on the compiler, processor, etc. The interest overhead will depend on the code you insert. The only way to find out is to take your code and profile in both directions - why there is no final answer.

+8
Sep 28 '08 at 2:30
source share

Your question is one of the questions that has no answer that could be called "absolute truth." The overhead of calling a normal function depends on three factors:

  • CPU. The overhead of x86, PPC, and ARM processors is very different, and even if you just stay with the same architecture, the overhead also varies greatly between Intel Pentium 4, Intel Core 2 Duo, and Intel Core i7. Overhead can even change markedly between Intel and AMD processors, even if they run at the same clock speed, since factors such as cache sizes, cache algorithms, memory access patterns, and the actual hardware implementation of the call code itself can have a huge impact on overhead .

  • ABI (Application Binary Interface). Even with the same CPU, different ABIs often exist that determine how the function calls the pass parameters (through registers, through the stack, or through a combination of both), and also where and how the freeze frame is initialized and cleared. All this affects overhead. Different operating systems may use different ABIs for the same CPU; for example, Linux, Windows, and Solaris can use three different ABIs for the same CPU.

  • Compiler. Strict compliance with ABI is only important if a function is called between independent blocks of code, for example. if the application calls the function of the system library or the user library calls the function of another user library. As long as the "private" functions are not visible outside a specific library or binary code, the compiler can "trick" it. It cannot strictly follow the ABI, but instead use shortcuts that lead to faster function calls. For example. it can pass parameters in the register instead of using the stack, or it can skip the setup of the stack frame and completely clear it if it really is not necessary.

If you want to know the overhead for a particular combination of the three factors mentioned above, for example, for Intel Core i5 on Linux using GCC, your only way to get this information is to compare the differences between the two implementations, one using function calls and where you copy the code directly to the caller; thus, you force the inlay into place, since the inline operator is just a hint and does not always lead to inlining.

However, the real question here is: is overhead really important? One thing is for sure: a function call always has overhead. It may be small, it may be large, but it certainly exists. And no matter how small this function is, if the function is called often enough in a performance-critical section, the overhead will to some extent matter. Inlining rarely makes your code slower if you are not terribly overdoing it; this will make the code bigger. Today, compilers do pretty well when they are built in and when not, so you are unlikely to ever need your brain.

Personally, I completely ignore inlining during development, until I have a more or less usable product that I can profile, and only if the profile tells me that a certain function is called very often, as well as in the critical section of the application, then I will consider "forced attachment" of this function.

So far my answer is very general, it applies to C as much as it does for C ++ and Objective-C. As a closing word, let me say something about C ++ in particular: virtual methods are calls with a double indirect function, which means that they have a higher overhead on functions than regular function calls, and also cannot be built in. Non-virtual methods can be built-in by the compiler or not, but even if they are not built-in, they are still significantly faster than virtual ones, so you should not create methods virtual unless you plan to redefine or redefine them.

+7
Mar 03 2018-12-12T00:
source share

I did a simple test against a simple increment function:

inc.c:

typedef unsigned long ulong; ulong inc(ulong x){ return x+1; } 

main.c

 #include <stdio.h> #include <stdlib.h> typedef unsigned long ulong; #ifdef EXTERN ulong inc(ulong); #else static inline ulong inc(ulong x){ return x+1; } #endif int main(int argc, char** argv){ if (argc < 1+1) return 1; ulong i, sum = 0, cnt; cnt = atoi(argv[1]); for(i=0;i<cnt;i++){ sum+=inc(i); } printf("%lu\n", sum); return 0; } 

Running it with a billion iterations on my Intel (R) Core i5 processor, the M 430 @ 2.27GHz processor gave me:

  • 1.4 seconds for version
  • 4.4 seconds for a regularly linked version

(It seems to fluctuate to 0.2, but I'm too lazy to calculate the correct standard deviations and don't care about them)

This suggests that the overhead of function calls on this computer is 3 nanoseconds

The fastest that I measured something on it was about 0.3 ns, so I would suggest the cost of calling the function of 9 primitive operations to make this very simplistic.

This overhead is increased by approximately one 2ns per call (total call time 6ns ) for functions called via PLT (functions in the shared library).

+6
Aug 07 '16 at 2:31 on
source share

For very small functions, inlining makes sense, because the (low) cost of calling a function is significant relative to the (very low) cost of the function body. For most functions in a few lines, this is not a big win.

+4
Sep 28 '08 at 2:10
source share

It is worth noting that the built-in function increases the size of the calling function, and anything that increases the size of the function can adversely affect caching. If you're right at the border, “just another thin coin bag” from the embedded code can have a dramatic effect on performance.




If you read literature warning about the "cost of calling a function", I would suggest that it may be older material that does not reflect modern processors. If you're not in the embedded world, the era in which C is the "portable assembly language" has essentially passed. A lot of the ingenuity of chip designers over the past decade (let’s say) has gone into all sorts of low-level complexities that can radically differ from how things work "on the same day."

+3
Oct 30 '08 at 17:44
source share

There is an excellent concept called "shadow registration", which allows you to pass (up to 6?) Values ​​through registers (on the processor) instead of the stack (memory). In addition, depending on the function and variables used internally, the compiler may simply decide that no frame management code is required!

In addition, even a C ++ compiler can perform tail tail recursion optimization, i.e. if A () calls B (), and after calling B () just returns, the compiler will reuse the stack frame !!

Of course, all this can be done only if the program adheres to the semantics of the standard (see smoothing the pointer and the effect on optimization)

+1
Sep 28 '08 at 2:43
source share

Modern processors are very fast (obviously!). Almost every operation involving calling and passing arguments is full speed instructions (indirect calls can be a bit more expensive, mainly for the first time through a loop).

The overhead of functional calls is so small that only the cycles that call the functions can do the workload.

Therefore, when we talk about upper function calls (and dimensions) today, we usually really talk about the overhead of the fact that they cannot raise general subexpressions from loops. If a function needs to do a bunch of (identical) work every time it is called, the compiler will be able to "pull" it out of the loop and do it once if it has been built in. If you didn’t specify, the code is likely to just continue and repeat, you told her!

Built-in functions seem impossible faster, not because of the overhead of the call and arguments, but because of common subexpressions that can be inferred from the function.

Example:

 Foo::result_type MakeMeFaster() { Foo t = 0; for (auto i = 0; i < 1000; ++i) t += CheckOverhead(SomethingUnpredictible()); return t.result(); } Foo CheckOverhead(int i) { auto n = CalculatePi_1000_digits(); return i * n; } 

The optimizer can see this nonsense and do:

 Foo::result_type MakeMeFaster() { Foo t; auto _hidden_optimizer_tmp = CalculatePi_1000_digits(); for (auto i = 0; i < 1000; ++i) t += SomethingUnpredictible() * _hidden_optimizer_tmp; return t.result(); } 

It seems that the overhead of the call cannot be reduced, because in fact he pulled most of the function from the loop (call CalculatePi_1000_digits). The compiler would have to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do this.

+1
Jan 15 '13 at 0:15
source share

There is not much overhead, especially with small (built-in) functions or even with classes.

In the following example, three different tests, each of which is performed many, many times and in time. Results are always equal to the order of a pair of the 1000th part of a unit of time.

 #include <boost/timer/timer.hpp> #include <iostream> #include <cmath> double sum; double a = 42, b = 53; //#define ITERATIONS 1000000 // 1 million - for testing //#define ITERATIONS 10000000000 // 10 billion ~ 10s per run //#define WORK_UNIT sum += a + b /* output 8.609619s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.0%) 8.604478s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.1%) 8.610679s wall, 8.595655s user + 0.000000s system = 8.595655s CPU(99.8%) 9.5e+011 9.5e+011 9.5e+011 */ #define ITERATIONS 100000000 // 100 million ~ 10s per run #define WORK_UNIT sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum) /* output 8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%) 8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%) 8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%) 2.50001e+015 2.50001e+015 2.50001e+015 */ // ------------------------------ double simple() { sum = 0; boost::timer::auto_cpu_timer t; for (unsigned long long i = 0; i < ITERATIONS; i++) { WORK_UNIT; } return sum; } // ------------------------------ void call6() { WORK_UNIT; } void call5(){ call6(); } void call4(){ call5(); } void call3(){ call4(); } void call2(){ call3(); } void call1(){ call2(); } double calls() { sum = 0; boost::timer::auto_cpu_timer t; for (unsigned long long i = 0; i < ITERATIONS; i++) { call1(); } return sum; } // ------------------------------ class Obj3{ public: void runIt(){ WORK_UNIT; } }; class Obj2{ public: Obj2(){it = new Obj3();} ~Obj2(){delete it;} void runIt(){it->runIt();} Obj3* it; }; class Obj1{ public: void runIt(){it.runIt();} Obj2 it; }; double objects() { sum = 0; Obj1 obj; boost::timer::auto_cpu_timer t; for (unsigned long long i = 0; i < ITERATIONS; i++) { obj.runIt(); } return sum; } // ------------------------------ int main(int argc, char** argv) { double ssum = 0; double csum = 0; double osum = 0; ssum = simple(); csum = calls(); osum = objects(); std::cout << ssum << " " << csum << " " << osum << std::endl; } 

The output to run 10,000,000 iterations (of each type: simple, six function calls, three object calls) was with this semi-confused workload:

 sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum) 

in the following way:

 8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%) 8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%) 8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%) 2.50001e+015 2.50001e+015 2.50001e+015 

Using a simple workload

 sum += a + b 

It gets the same results, with the exception of several orders of magnitude for each case.

+1
Feb 09 '15 at 20:59
source share

Each new function requires the creation of a new local stack. But the overhead of this would be noticeable only if you call the function at each iteration of the loop at a very large number of iterations.

0
Sep 28 '08 at 2:14
source share

For most functions, they are not additional overhead for calling them in C ++ vs C (unless you think that the "this" pointer is an unnecessary argument for each function). You must pass the state of the function in some way) ...

For virtual functions, their additional level of indirection (the equivalent of calling a function through a pointer in C) ... But in fact, on today's hardware this is trivial.

0
Sep 28 '08 at 2:14
source share

I don't have numbers either, but I'm glad you ask. Too often, I see people trying to optimize their code, starting with vague ideas about overhead, but not really knowing.

0
Sep 28 '08 at 2:38
source share

There are several issues here.

  • If you have a smart enough compiler, it will do the automatic insertion for you, even if you didn't specify the built-in. On the other hand, there are many things that cannot be built in.

  • If the function is virtual, then of course you are going to pay a price that it cannot be built in because the goal is determined at runtime. Conversely, in Java you can pay this price if you do not indicate that this method is final.

  • Depending on how your code is organized in memory, you can pay for the cost of misses in the cache and even skip pages, since the code is in a different place. This can have a huge impact on some applications.

0
Sep 28 '08 at 2:39
source share

Depending on how you structure your code, dividing into units such as modules and libraries, this can make a deep difference in some cases.

  • Using a dynamic library function with external communication in most cases imposes full processing of stack frames.
    This is why using qsort from the stdc library is an order of magnitude smaller (10 times) slower than using stl code, when the comparison operation is as simple as integer comparison.
  • Function pointers between modules will also be affected.
  • The same punishment will most likely affect the use of C ++ virtual functions, as well as other functions whose code is defined in separate modules.

  • The good news is that optimizing the entire program can solve the problem for dependencies between static libraries and modules.

0
Feb 09 '16 at 11:09
source share

As others have said, you really don't need to worry too much about overhead unless you have reached maximum performance or something similar. When you create a function, the compiler should write the code:

  • Save functional parameters on the stack
  • Save stack return address
  • Go to function start address
  • Highlight space for local variables (stack) function
  • Run function body
  • Save return value (stack)
  • Free space for local variables aka garbage collection
  • Return to saved return address
  • Free saving of parameters, etc.

However, you need to consider that you are reducing the readability of your code, as well as how this will affect your testing strategies, maintenance plans, and the overall size of your src file.

0
Sep 23 '17 at 5:29 on
source share



All Articles