L2 teams miss skips much higher than skips of L1 team

Question

L2 teams miss skips much higher than skips of L1 team

I am creating a synthetic C test aimed at invoking a large number of command skips using the following Python script:

#!/usr/bin/env python import tempfile import random import sys if __name__ == '__main__': functions = list() for i in range(10000): func_name = "f_{}".format(next(tempfile._get_candidate_names())) sys.stdout.write("void {}() {{\n".format(func_name)) sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n") sys.stdout.write(" res = pi*r*r*h;\n") sys.stdout.write(" res = res/(e*e);\n") sys.stdout.write("}\n") functions.append(func_name) sys.stdout.write("int main() {\n") sys.stdout.write("unsigned int i;\n") sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n") for i in range(10000): r = random.randint(0, len(functions)-1) sys.stdout.write("{}();\n".format(functions[r])) sys.stdout.write("}\n") sys.stdout.write("}\n")

What the code does is simply create a C program, which consists of many randomly named dummy functions, which in turn are called randomly in main() . I am compiling the resulting code with gcc 4.8.5 on CentOS 7 with -O0 . The code runs on a dual-core machine equipped with 2x Intel Xeon E5-2630v3 (Haswell architecture).

I'm interested in understanding instruction counters reported by perf when profiling a binary compiled from C code (not a Python script), which is used only to automatically generate code). In particular, I observe the following counters with perf stat :

instructions
L1-icache-load-misses (teams choose to skip L1, aka r0280 on Haswell)
r2424 , L2_RQSTS.CODE_RD_MISS (command selects skip L2)
rf824 , L2_RQSTS.ALL_PF (all requests for pre-selection of L2 equipment, both code and data)

First, I profiled the code in which all hardware prefixers are disabled in the BIOS, i.e.

Disabled MLC Streamer
Disabled MLC pre-cleaner.
DCU Data Failover disabled.
Disabled Prefetcher DCU instructions

and the results are as follows: the process is bound to the first core of the second processor and the corresponding NUMA domain, but I think it does not matter much):

 perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,108,610,204 instructions 2,613,075,664 L1-icache-load-misses 5,065,167,059 r2424 17 rf824 33.696954142 seconds time elapsed

Given the numbers above, I cannot explain such a large number of missed team errors in L2. I have disabled all prefeters, and L2_RQSTS.ALL_PF confirms this. But why do I see twice as many missed teams in L2 than in L1i? In my (simple) model of the mental processor, if the instruction is viewed in L2, it must be viewed in L1i before. Is it clear that I'm wrong, what am I missing?

Then I tried to run the same code with all hardware prefixes enabled, i.e.

MLC Streamer Enabled
MFC Spatial Prefetcher Enabled
DCU Data Fuse Enabled
DCU pre-set instructions included

and the results are as follows:

 perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,109,877,626 instructions 2,599,883,072 L1-icache-load-misses 5,054,883,231 r2424 908,494 rf824

Now L2_RQSTS.ALL_PF seems to indicate that something else is happening, and although I expected the prefetcher to be a little more aggressive, I believe that the instruction prefetcher is strictly put to the test due to the intensive load type and pre-data collector don't have much to do with such a workload. But then again, L2_RQSTS.CODE_RD_MISS is still too high, with prefets turned on.

So to summarize my question:

With hardware prefilters disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses . Even with the preliminary hardware set turned on, I still cannot explain it. What is the reason for such a high L2_RQSTS.CODE_RD_MISS value compared to L1-icache-load-misses ?

+5

performance intel cpu-cache cpu-architecture perf

Marco guerri Jan 28 '17 at 13:16

source share

No one has answered this question yet.

See related questions:

23498

Why is processing a sorted array faster than processing an unsorted array?

2643

Why is print "B" much slower than print "#"?

2116

Why are stigment additions much faster in individual cycles than in a combined cycle?