I am creating a synthetic C test aimed at invoking a large number of command skips using the following Python script:
#!/usr/bin/env python import tempfile import random import sys if __name__ == '__main__': functions = list() for i in range(10000): func_name = "f_{}".format(next(tempfile._get_candidate_names())) sys.stdout.write("void {}() {{\n".format(func_name)) sys.stdout.write(" double pi = 3.14, r = 50, h = 100, e = 2.7, res;\n") sys.stdout.write(" res = pi*r*r*h;\n") sys.stdout.write(" res = res/(e*e);\n") sys.stdout.write("}\n") functions.append(func_name) sys.stdout.write("int main() {\n") sys.stdout.write("unsigned int i;\n") sys.stdout.write("for(i =0 ; i < 100000 ;i ++ ){\n") for i in range(10000): r = random.randint(0, len(functions)-1) sys.stdout.write("{}();\n".format(functions[r])) sys.stdout.write("}\n") sys.stdout.write("}\n")
What the code does is simply create a C program, which consists of many randomly named dummy functions, which in turn are called randomly in main() . I am compiling the resulting code with gcc 4.8.5 on CentOS 7 with -O0 . The code runs on a dual-core machine equipped with 2x Intel Xeon E5-2630v3 (Haswell architecture).
I'm interested in understanding instruction counters reported by perf when profiling a binary compiled from C code (not a Python script), which is used only to automatically generate code). In particular, I observe the following counters with perf stat :
- instructions
- L1-icache-load-misses (teams choose to skip L1, aka r0280 on Haswell)
- r2424 , L2_RQSTS.CODE_RD_MISS (command selects skip L2)
- rf824 , L2_RQSTS.ALL_PF (all requests for pre-selection of L2 equipment, both code and data)
First, I profiled the code in which all hardware prefixers are disabled in the BIOS, i.e.
- Disabled MLC Streamer
- Disabled MLC pre-cleaner.
- DCU Data Failover disabled.
- Disabled Prefetcher DCU instructions
and the results are as follows: the process is bound to the first core of the second processor and the corresponding NUMA domain, but I think it does not matter much):
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,108,610,204 instructions 2,613,075,664 L1-icache-load-misses 5,065,167,059 r2424 17 rf824 33.696954142 seconds time elapsed
Given the numbers above, I cannot explain such a large number of missed team errors in L2. I have disabled all prefeters, and L2_RQSTS.ALL_PF confirms this. But why do I see twice as many missed teams in L2 than in L1i? In my (simple) model of the mental processor, if the instruction is viewed in L2, it must be viewed in L1i before. Is it clear that I'm wrong, what am I missing?
Then I tried to run the same code with all hardware prefixes enabled, i.e.
- MLC Streamer Enabled
- MFC Spatial Prefetcher Enabled
- DCU Data Fuse Enabled
- DCU pre-set instructions included
and the results are as follows:
perf stat -e instructions,L1-icache-load-misses,r2424,rf824 numactl --physcpubind=8 --membind=1 /tmp/code Performance counter stats for 'numactl --physcpubind=8 --membind=1 /tmp/code': 25,109,877,626 instructions 2,599,883,072 L1-icache-load-misses 5,054,883,231 r2424 908,494 rf824
Now L2_RQSTS.ALL_PF seems to indicate that something else is happening, and although I expected the prefetcher to be a little more aggressive, I believe that the instruction prefetcher is strictly put to the test due to the intensive load type and pre-data collector don't have much to do with such a workload. But then again, L2_RQSTS.CODE_RD_MISS is still too high, with prefets turned on.
So to summarize my question:
With hardware prefilters disabled, L2_RQSTS.CODE_RD_MISS seems to be much higher than L1-icache-load-misses . Even with the preliminary hardware set turned on, I still cannot explain it. What is the reason for such a high L2_RQSTS.CODE_RD_MISS value compared to L1-icache-load-misses ?