I use Cachegrind to get the number of misses in the cache of a static program compiled without libc (just _start , which calls my main function and syscall output in asm). The program is completely deterministic; instructions and memory references do not change from one run to another. The cache is fully associative with LRU as a replacement policy.
However, I noticed that the number of misses sometimes varies. More specifically, the number of passes is always the same until I switch to another directory:
% cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./adpcm ... ==31352== I refs: 216,145,010 ... ==31352== D refs: 130,481,003 (95,186,001 rd + 35,295,002 wr) ==31352== D1 misses: 240,004 ( 150,000 rd + 90,004 wr) ==31352== LLd misses: 31 ( 11 rd + 20 wr)
And if I repeat the same command again and again, I will continue to get the same results. But if I ran this program from another directory:
% cd .. % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./malardalen2/adpcm ... ==31531== I refs: 216,145,010 ... ==31531== D refs: 130,481,003 (95,186,001 rd + 35,295,002 wr) ==31531== D1 misses: 250,004 ( 160,000 rd + 90,004 wr) ==31531== LLd misses: 31 ( 11 rd + 20 wr)
And I have another result from another directory.
I also did some experiments with the Pin tool, and with this I do not need to change the directory to get different values. But it seems that the set of possible values is very limited and exactly the same as Cachegrind.
My question is: what might be the sources of such differences?
My first advice is that my program is not aligned equally in memory, and as a result, some variables stored on the same line in the previous run no longer exist. It may also explain the limited number of combinations. But, despite the fact that cachegrind (and Pin) used virtual addresses, I would assume that the OS (Linux) always gives the same virtual addresses. Any other idea?
Edit: As you can guess by reading LLd passes, the program uses only 31 different cache lines. In addition, a cache can contain only 8 cache lines. So even in reality the difference cannot be explained by the thought that the cache is already populated a second time (at max, only 8 lines can remain in L1).
Edit 2: The Cachegrind report is not based on actual cache misses (set by performance counters), but is a simulation result. Basically, it simulates cache behavior to count the number of misses. Since the consequences are only temporary, it is completely beautiful and allows you to change the properties of the cache (size, associativity).
Edit 3: The hardware I use is Intel Core i7 on Linux 3.2 x86_64. Compilation flags are static and for some programs -nostdlib (IIRC, I'm not at home right now).