So I became a little curious about this and did some research. Basically, to find out how much more useful the frame frame is now, since I used it, it crashed the kernel on a shared dev machine, and 25 other developers were very unhappy with my experiments.
First make sure that I see what you see:
$ cc -O -o xx xx.c && perf stat -e L1-dcache-loads,L1-dcache-stores ./xx Performance counter stats for './xx': 58,764,160 L1-dcache-loads 81,640,635 L1-dcache-stores
Yeah. More big numbers. So what's going on? Let me write and analyze this a little better:
$ cc -O -o xx xx.c && perf record -e L1-dcache-loads,L1-dcache-stores ./xx [... blah blah ...] $ perf report --stdio [... blah blah ...]
Aha, so the kernel is responsible for most of these loads and storage. The counters we get count the access to the cache that the kernel and user space do.
What happens is that the physical pages of the program (including the data segment and bss) are not displayed or even highlighted when the program starts. The kernel eliminates them when you touch them for the first time (or in the future if they are unloaded). We can see it as follows:
$ cc -O -o foo foo.c && perf stat -e faults ./xx Performance counter stats for './xx': 58,696 faults
We actually execute errors on page 58.7k only during this launch. Since the page size is 4096 bytes, we get 58696*4096=240418816
, which is about 240,000,000 bytes for your arrays, and the rest is the program, the stack, and all kinds of junk files in libc and ld.so needed to work.
So now we can define the numbers. Take a look at the stores first, because they should be the easiest way to find out. 80623804*0.3828=30862792.1712
, so that makes sense. We expected 30 million stores, and we got 30.9. Since the sample performance counters are not entirely accurate, this is expected. Some of the loads that the kernel really shed were taken into account in the program. In other races, I received less than 30 million counters for the user area.
In the same way, userland receives 2.4M load. I suspect that in fact they do not load in userland, but for some reason some calls to the kernel when returning from traps that are taken into account in your program. Or something like that. Iβm not sure about this, I donβt like them, but let's see if we remove this noise and check the theory that it has something to do with garbage data caused by page errors.
Here is the updated version of your test:
#include <stdio.h>
I am sure that during setup
all the page errors will be detected, and then all the counters that spill during the bench
should have very little kernel noise in them.
$ cc -O -o xx xx.c && perf record -e faults,L1-dcache-loads,L1-dcache-stores ./xx [...] $ perf report --stdio [...]
And you have it. Page errors occurred during the memset
call, and some during the dynamic linking, noise, which was previously mainly, now occurs during the memset
, the bench
itself does not have loads and about 30 million stores. As expected. It is interesting to note that memset
knows how to be effective on this machine, and only half of the stores compared to your test to fill the same amount of memory. "Sse2" in __memset_sse2
is a good hint on how.
I only realized that one thing might be obscure, and I donβt know where to put it, so Iβll drop it here. Performance counters accurately count events, but as far as I know, if you want to know where these events occur, the CPU can only generate a trap only after each X event. Thus, the tools do not know exactly where the events occur (it would be too slow to start this path), instead we wait until a trap appears, and we enumerate all the X events to this command / function. I think, but I'm not sure that X is at least 10,000. So if the bench
function just touches the stack once, and this happens to create the L1-dcache-load spill trap, you will consider 10,000 views reading the stack. In addition, as far as I know, TLB skips (of which you will get about 58593) in the bench
function are also allowed using the L1 cache and will be taken into account in this. Therefore, no matter what you do, you will never get exactly the numbers you expect from here.