Why is the C tool faster?

I am working on a (fairly large) existing single-user C application. In this context, I modified the application to do very little extra work, which is to increase the counter each time we call a special function (this function is called ~ 80,000 times). The application was compiled on Ubuntu 12.04 with the 64-bit Linux kernel 3.2.0-31-generic with the -O3 option.

Surprisingly, the instrumental version of the code is faster, and I am studying why. I measure the runtime using clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and get representative results, reporting an average runtime of more than 100 runs. Moreover, in order to avoid interference from the outside world, I tried to run the application on the system as much as possible without any other running applications (on the side of the note, since CLOCK_PROCESS_CPUTIME_ID returns the process time, not the wall clock time, other applications “should” in the theory affects only the cache, not directly on the execution time of the process)

I suspected "effects of the command cache", perhaps instrumental code that is slightly larger (several bytes) fits differently and better in the cache, is this hypothesis possible? I tried to do some cache research with valegrind --tool = cachegrind, but unfortunately the instrumental version has (as it seems logical) more cache misses than the original version.

Any hints on this topic and ideas that can help find out why the tool code is faster are welcome (some GCC optimizations are available in one case and not in another, why?, ...)

+4
source share
2 answers

Since there are not many details in the question, I can only recommend some factors that should be considered when researching the problem.

A few additional work (for example, increasing the counter) can change the compiler’s decision on whether to apply any optimizations or not. The compiler does not always have enough information to make the perfect choice. He may try to optimize the speed when the bottleneck is the size of the code. He may try to automate the vectorization of calculations when not much data is processed. The compiler may not know what data should be processed or what is the exact processor model that will execute the code.

  • The increment of the counter can increase the size of some cycle and prevent the cycle from turning. This can reduce code size (and improve code locality, which is good for instruction or microcode caches or for a loop buffer and allows the processor to quickly receive / decode instructions).
  • Incrementing the counter can increase the size of some function and prevent inlining. It can also reduce code size.
  • Incrementing a counter can prevent automatic vectorization, which can also reduce code size.

Even if this change does not affect compiler optimization, it may change the way the processor executes the code.

  • If you insert a counter incremental code into a place full of branch targets, this can cause branching objects to be less dense and improve branch prediction.
  • If you insert a counter incremental code before any specific branch goal, this can lead to more accurate alignment of the branch address and faster code output.
  • If you put the increment-corrected code after some data has been written, but before reloading the same data (and the transfer between the storage and loading does not work for some reason), the download operation may be performed earlier.
  • An increment-corrected code insertion can prevent two conflicting attempts to load the same bank in the L1 data cache.
  • Incrementally corrected code insertion can change the decision for the CPU scheduler and make the execution port available just in time for any performance-critical instruction.

To explore the effects of compiler optimization, you can compare the generated assembler code before and after adding incremental code.

To examine the effects of CPUs, use a profiler that allows you to check processor performance counters.

+4
source

Just by guessing my impressions of the built-in compilers, optimization tools in compilers look for recursive tasks. Perhaps the extra code made the compiler see something more recursive and structured the machine code differently. Compilers do some weird things to optimize. In some languages ​​(Perl, I think?), The conditional "no" condition is faster than the "true" condition. Does your debugging tool provide one step through code / assembly comparison? This may add some idea of ​​what the compiler decided to do with additional tasks.

+1
source

Source: https://habr.com/ru/post/1437138/


All Articles