Understanding the number of downloads and retired stores in x86 micro-control

Question

Understanding the number of downloads and retired stores in x86 micro-control

I have been using perf recently, and I have some results that I cannot understand. In particular, the quantity or retired goods and stores do not meet my expectations.

I did a very simple micro test to see if the results make sense in a very simple case:

#include <stdio.h> #define STREAM_ARRAY_SIZE 10000000 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; int main(){ ssize_t j; for (j=0; j<STREAM_ARRAY_SIZE; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0; } return 0; }

Compiled with gcc 4.6.3:

 gcc -Wall -O benchmark.c -o benchmark

and it compiles into a very simple piece of the assembly (obtained with objdump -d) for the main one:

 00000000004004b4 <main>: 4004b4: b8 00 00 00 00 mov $0x0,%eax 4004b9: 48 be 00 00 00 00 00 movabs $0x3ff0000000000000,%rsi 4004c0: 00 f0 3f 4004c3: 48 b9 00 00 00 00 00 movabs $0x4000000000000000,%rcx 4004ca: 00 00 40 4004cd: ba 00 00 00 00 mov $0x0,%edx 4004d2: 48 89 34 c5 40 10 60 mov %rsi,0x601040(,%rax,8) 4004d9: 00 4004da: 48 89 0c c5 40 c4 24 mov %rcx,0x524c440(,%rax,8) 4004e1: 05 4004e2: 48 89 14 c5 40 78 e9 mov %rdx,0x9e97840(,%rax,8) 4004e9: 09 4004ea: 48 83 c0 01 add $0x1,%rax 4004ee: 48 3d 80 96 98 00 cmp $0x989680,%rax 4004f4: 75 dc jne 4004d2 <main+0x1e> 4004f6: b8 00 00 00 00 mov $0x0,%eax 4004fb: c3 retq 4004fc: 90 nop 4004fd: 90 nop 4004fe: 90 nop 4004ff: 90 nop

Three mov must correspond to the storage of three different vectors in memory. I would expect the number of stores to go very close to 30M and almost without load, as I just initialize the three arrays. However, these are the results that I get on a Sandy Bridge machine:

 $ perf stat -e L1-dcache-loads,L1-dcache-stores ./benchmark Performance counter stats for './benchmark': 46,017,360 L1-dcache-loads 75,985,205 L1-dcache-stores

And this is for the Nehalem machine:

 $ perf stat -e L1-dcache-loads,L1-dcache-stores ./benchmark Performance counter stats for './benchmark': 45,255,731 L1-dcache-loads 60,164,676 L1-dcache-stores

How are deferred loads and magazines taken into account for each mov operation that is memory oriented? Why are there so many downloads even if data is not read from memory?

+6

c assembly gcc x86 perf

igon Nov 21 '14 at 0:21

source share

1 answer

Art · Accepted Answer · 2014-11-28T17:07:09+0000

So I became a little curious about this and did some research. Basically, to find out how much more useful the frame frame is now, since I used it, it crashed the kernel on a shared dev machine, and 25 other developers were very unhappy with my experiments.

First make sure that I see what you see:

 $ cc -O -o xx xx.c && perf stat -e L1-dcache-loads,L1-dcache-stores ./xx Performance counter stats for './xx': 58,764,160 L1-dcache-loads 81,640,635 L1-dcache-stores

Yeah. More big numbers. So what's going on? Let me write and analyze this a little better:

 $ cc -O -o xx xx.c && perf record -e L1-dcache-loads,L1-dcache-stores ./xx [... blah blah ...] $ perf report --stdio [... blah blah ...] # Samples: 688 of event 'L1-dcache-loads' # Event count (approx.): 56960661 # # Overhead Command Shared Object Symbol # ........ ....... ................. ........ # 95.80% xx [kernel.kallsyms] [k] 0xffffffff811176ee 4.20% xx xx [.] main # Samples: 656 of event 'L1-dcache-stores' # Event count (approx.): 80623804 # # Overhead Command Shared Object Symbol # ........ ....... ................. ........ # 61.72% xx [kernel.kallsyms] [k] 0xffffffff811176ee 38.28% xx xx [.] main

Aha, so the kernel is responsible for most of these loads and storage. The counters we get count the access to the cache that the kernel and user space do.

What happens is that the physical pages of the program (including the data segment and bss) are not displayed or even highlighted when the program starts. The kernel eliminates them when you touch them for the first time (or in the future if they are unloaded). We can see it as follows:

 $ cc -O -o foo foo.c && perf stat -e faults ./xx Performance counter stats for './xx': 58,696 faults

We actually execute errors on page 58.7k only during this launch. Since the page size is 4096 bytes, we get 58696*4096=240418816 , which is about 240,000,000 bytes for your arrays, and the rest is the program, the stack, and all kinds of junk files in libc and ld.so needed to work.

So now we can define the numbers. Take a look at the stores first, because they should be the easiest way to find out. 80623804*0.3828=30862792.1712 , so that makes sense. We expected 30 million stores, and we got 30.9. Since the sample performance counters are not entirely accurate, this is expected. Some of the loads that the kernel really shed were taken into account in the program. In other races, I received less than 30 million counters for the user area.

In the same way, userland receives 2.4M load. I suspect that in fact they do not load in userland, but for some reason some calls to the kernel when returning from traps that are taken into account in your program. Or something like that. I’m not sure about this, I don’t like them, but let's see if we remove this noise and check the theory that it has something to do with garbage data caused by page errors.

Here is the updated version of your test:

 #include <stdio.h> #define STREAM_ARRAY_SIZE 10000000 static double a[STREAM_ARRAY_SIZE], b[STREAM_ARRAY_SIZE], c[STREAM_ARRAY_SIZE]; void setup(void) { memset(a, 0, sizeof a); memset(b, 0, sizeof b); memset(c, 0, sizeof c); } void bench(void) { ssize_t j; for (j = 0; j < STREAM_ARRAY_SIZE; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0; } } int main(int argc, char **argv) { setup(); bench(); return 0; }

I am sure that during setup all the page errors will be detected, and then all the counters that spill during the bench should have very little kernel noise in them.

 $ cc -O -o xx xx.c && perf record -e faults,L1-dcache-loads,L1-dcache-stores ./xx [...] $ perf report --stdio [...] # Samples: 468 of event 'faults' # Event count (approx.): 58768 # # Overhead Command Shared Object Symbol # ........ ....... ................. ................. # 99.20% xx libc-2.12.so [.] __memset_sse2 0.69% xx ld-2.12.so [.] do_lookup_x 0.08% xx ld-2.12.so [.] dl_main 0.02% xx ld-2.12.so [.] _dl_start 0.01% xx ld-2.12.so [.] _start 0.01% xx [kernel.kallsyms] [k] 0xffffffff8128f75f # Samples: 770 of event 'L1-dcache-loads' # Event count (approx.): 61518838 # # Overhead Command Shared Object Symbol # ........ ....... ................. ................. # 96.14% xx [kernel.kallsyms] [k] 0xffffffff811176ee 3.86% xx libc-2.12.so [.] __memset_sse2 # Samples: 866 of event 'L1-dcache-stores' # Event count (approx.): 98243116 # # Overhead Command Shared Object Symbol # ........ ....... ................. ................. # 53.69% xx [kernel.kallsyms] [k] 0xffffffff811176ee 30.62% xx xx [.] bench 15.69% xx libc-2.12.so [.] __memset_sse2

And you have it. Page errors occurred during the memset call, and some during the dynamic linking, noise, which was previously mainly, now occurs during the memset , the bench itself does not have loads and about 30 million stores. As expected. It is interesting to note that memset knows how to be effective on this machine, and only half of the stores compared to your test to fill the same amount of memory. "Sse2" in __memset_sse2 is a good hint on how.

I only realized that one thing might be obscure, and I don’t know where to put it, so I’ll drop it here. Performance counters accurately count events, but as far as I know, if you want to know where these events occur, the CPU can only generate a trap only after each X event. Thus, the tools do not know exactly where the events occur (it would be too slow to start this path), instead we wait until a trap appears, and we enumerate all the X events to this command / function. I think, but I'm not sure that X is at least 10,000. So if the bench function just touches the stack once, and this happens to create the L1-dcache-load spill trap, you will consider 10,000 views reading the stack. In addition, as far as I know, TLB skips (of which you will get about 58593) in the bench function are also allowed using the L1 cache and will be taken into account in this. Therefore, no matter what you do, you will never get exactly the numbers you expect from here.

Understanding the number of downloads and retired stores in x86 micro-control

More articles: