I am working on a (fairly large) existing single-user C application. In this context, I modified the application to do very little extra work, which is to increase the counter each time we call a special function (this function is called ~ 80,000 times). The application was compiled on Ubuntu 12.04 with the 64-bit Linux kernel 3.2.0-31-generic with the -O3 option.
Surprisingly, the instrumental version of the code is faster, and I am studying why. I measure the runtime using clock_gettime(CLOCK_PROCESS_CPUTIME_ID) and get representative results, reporting an average runtime of more than 100 runs. Moreover, in order to avoid interference from the outside world, I tried to run the application on the system as much as possible without any other running applications (on the side of the note, since CLOCK_PROCESS_CPUTIME_ID returns the process time, not the wall clock time, other applications “should” in the theory affects only the cache, not directly on the execution time of the process)
I suspected "effects of the command cache", perhaps instrumental code that is slightly larger (several bytes) fits differently and better in the cache, is this hypothesis possible? I tried to do some cache research with valegrind --tool = cachegrind, but unfortunately the instrumental version has (as it seems logical) more cache misses than the original version.
Any hints on this topic and ideas that can help find out why the tool code is faster are welcome (some GCC optimizations are available in one case and not in another, why?, ...)
source share