Different cache miss counts for the same program between multiple runs

I use Cachegrind to get the number of misses in the cache of a static program compiled without libc (just _start , which calls my main function and syscall output in asm). The program is completely deterministic; instructions and memory references do not change from one run to another. The cache is fully associative with LRU as a replacement policy.

However, I noticed that the number of misses sometimes varies. More specifically, the number of passes is always the same until I switch to another directory:

  % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./adpcm ... ==31352== I refs: 216,145,010 ... ==31352== D refs: 130,481,003 (95,186,001 rd + 35,295,002 wr) ==31352== D1 misses: 240,004 ( 150,000 rd + 90,004 wr) ==31352== LLd misses: 31 ( 11 rd + 20 wr) 

And if I repeat the same command again and again, I will continue to get the same results. But if I ran this program from another directory:

  % cd .. % cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./malardalen2/adpcm ... ==31531== I refs: 216,145,010 ... ==31531== D refs: 130,481,003 (95,186,001 rd + 35,295,002 wr) ==31531== D1 misses: 250,004 ( 160,000 rd + 90,004 wr) ==31531== LLd misses: 31 ( 11 rd + 20 wr) 

And I have another result from another directory.

I also did some experiments with the Pin tool, and with this I do not need to change the directory to get different values. But it seems that the set of possible values ​​is very limited and exactly the same as Cachegrind.

My question is: what might be the sources of such differences?

My first advice is that my program is not aligned equally in memory, and as a result, some variables stored on the same line in the previous run no longer exist. It may also explain the limited number of combinations. But, despite the fact that cachegrind (and Pin) used virtual addresses, I would assume that the OS (Linux) always gives the same virtual addresses. Any other idea?

Edit: As you can guess by reading LLd passes, the program uses only 31 different cache lines. In addition, a cache can contain only 8 cache lines. So even in reality the difference cannot be explained by the thought that the cache is already populated a second time (at max, only 8 lines can remain in L1).

Edit 2: The Cachegrind report is not based on actual cache misses (set by performance counters), but is a simulation result. Basically, it simulates cache behavior to count the number of misses. Since the consequences are only temporary, it is completely beautiful and allows you to change the properties of the cache (size, associativity).

Edit 3: The hardware I use is Intel Core i7 on Linux 3.2 x86_64. Compilation flags are static and for some programs -nostdlib (IIRC, I'm not at home right now).

+4
source share
2 answers

Linux implements a randomization method for address space layout ( http://en.wikipedia.org/wiki/Address_space_layout_randomization ) for security issues. And you can deactivate this behavior as follows:

 echo -n "0" > /proc/sys/kernel/randomize_va_space 

You can check this in this example:

 #include <stdio.h> int main() { char a; printf("%u\n", &a); return 0; } 

You should always have the same value.

Before:

  % ./a.out 4006500239 % ./a.out 819175583 % ./a.out 2443759599 % ./a.out 2432498159 

After:

  % ./a.out 4294960207 % ./a.out 4294960207 % ./a.out 4294960207 % ./a.out 4294960207 

This also explains the different number of misses in the cache, because the two variables that were on the same line can now be on two different lines.

Edit: This does not completely solve the problem, but I think that was one of the reasons. I will give generosity to everyone who can help me solve this problem.

+4
source

This seems to be a known behavior in valgrind:

I used an example that displays the base address of the cache, I also disabled layout randomization.

I ran the executable file, getting the same results twice in both runs:

 D refs: 40,649 (28,565 rd + 12,084 wr) ==15016== D1 misses: 11,465 ( 8,412 rd + 3,053 wr) ==15016== LLd misses: 1,516 ( 1,052 rd + 464 wr) ==15016== D1 miss rate: 28.2% ( 29.4% + 25.2% ) ==15016== LLd miss rate: 3.7% ( 3.6% + 3.8% ) villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out ==15019== D refs: 40,649 (28,565 rd + 12,084 wr) ==15019== D1 misses: 11,465 ( 8,412 rd + 3,053 wr) ==15019== LLd misses: 1,516 ( 1,052 rd + 464 wr) ==15019== D1 miss rate: 28.2% ( 29.4% + 25.2% ) ==15019== LLd miss rate: 3.7% ( 3.6% + 3.8% ) 

According to cachegrind documentation ( http://www.cs.washington.edu/education/courses/cse326/05wi/valgrind-doc/cg_main.html )

Another thing that does not cost anything is very sensitive results. Changing the size of the file> valgrind.so, the size of the program being profiled, or even the length of its name may violate the results. Variations will be small, but don't expect perfect> repeatable results if your program changes at all. Although these factors mean that you do not have to trust the results to be super accurate, we hope that they should be close enough to be useful.

After reading this, I changed the file name and got the following:

 villar@localhost ~ $ mv a.out a.out2345345345 villar@localhost ~ $ cache=8 && valgrind --tool=cachegrind --I1=$((cache * 64)),$cache,64 --D1=$((cache * 64)),$cache,64 --L2=262144,4096,64 ./a.out2345345345 ==15022== D refs: 40,652 (28,567 rd + 12,085 wr) ==15022== D1 misses: 10,737 ( 8,201 rd + 2,536 wr) ==15022== LLd misses: 1,517 ( 1,054 rd + 463 wr) ==15022== D1 miss rate: 26.4% ( 28.7% + 20.9% ) ==15022== LLd miss rate: 3.7% ( 3.6% + 3.8% ) 

Changing the name to "a.out" gave me the same result as before.

Please note that changing the file name or path to it will change the stack base !!. and this may be the reason after reading what Mr. Eugene said in a previous comment

When you change the current working directory, you also change the corresponding environment variable (and its length). Since a copy of all environment variables is usually stored just above the stack, you get a different distribution for the stack variables and a different number of cache misses. (And the shell can change some other variables besides "PWD").

EDIT: the documentation also says:

Entering / disabling a program causes a lot of functions that are not interesting and simply complicate the output. It would be nice to exclude them.

A simulated cache can track the beginning and end of a program that is a source of change.

+2
source

Source: https://habr.com/ru/post/1488718/


All Articles