Compiler optimization effect on FLOP and L2 / L3 Cache Miss Rate using PAPI

Thus, we are instructed to assign compilation of some code (we should consider it as a black box) using various compiler optimization flags (-O1 and -O3), as well as vectorization flags (-xhost and -no-vec) and observe the changes:

  • lead time
  • Floating point operations (FPOs)
  • L2 and L3 Cache Miss Rate

After performing these optimizations, we noticed a decrease in execution time, which was to be expected, given all the changes that the compiler makes for your code for the sake of efficiency. However, we also noticed a decrease in the number of FPOs, which, although we understand that this is good, we do not know why this happened. In addition, we noticed (and cannot explain) an increase in L2 Cache Miss Rate (increasing as the optimization level increases), but not a significant increase in the number of cached accesses and almost no changes at the L3 level.

Using without vectorization or optimization in general gave a better result in terms of L2 Cache Miss Rate, and we were wondering if you guys could give us some insights as well as the supported documentation, literature and resources that we can use to deepen our knowledge on this topic.

Thanks.

edit: Used compiler options:

  • O0 -no-vec (Highest Run Time, Lowest L2 Cache Skips)
  • O1 (less runtime, higher L2 cache misses)
  • O3 (even less runtime, even higher skips of the L2 cache)
  • xhost (same run-time order -O3, highest L2 cache misses)

Update:

Despite a slight decrease in overall access to the L2 cache, there has been a significant increase in actual misses.

C -0O -no-vec

Usecs wall clock time: 13,957,075

  • L2 Cache Skips: 207 460 564
  • L2 Cache Access: 1 476 540 355
  • L2 Cache Skip Error: 0.140504
  • L3 Cache Skips: 24,841,999
  • L3 Cache Access: 207 460 564
  • L3 cache skip error: 0.119743

C -xhost

Usecs wall clock time: 4,465,243

  • L2 Cache Misses: 547,305,377
  • L2 Cache Access: 1,051,949,467
  • L2 Cache Bandwidth: 0.520277
  • L3 Cache Skips: 86,919,153
  • L3 Cache Access: 547,305,377
  • L3 Cache Skip Error: 0.158813
+6
source share
2 answers

About the reduced number of floating point operations:
With optimization, the compiler can lift general computations from loops, fuse constants, pre-evaluate expressions, etc.

About increasing cache throughput:
If the compiler uses the vector and every time loads the full value of the vector width data, it will use much less memory loads. But every time he accesses the line cache in such a way that the predictor did not expect, he still causes a cache miss.
Together, you have fewer loads, but about the same amount of skittle affected, so miss speeds can be higher.

+2
source

The EOF answer has a good explanation for fewer floating point operations: -ffast-math union of operation types, so I will just answer the other part.


There is no information in the question about which processor microarchitecture was used, but at least it is labeled .

Intel processors have some logic for prefetching in L1 and more complex logic for prefetching in L2 (from L3 or main memory). Each core has its own L2, but lower levels of the cache hierarchy are common, so there is an obvious place to host the main prefetch logic.

If you read slower than the memory bandwidth limits, your loads will fall into L2, because the hardware prefader will already select these lines in L2. If the prefetch cannot keep up, you will get L2 cache skips .

Fewer wider loads instead of many scalar loads also means that the% miss will be worse with vectors. (The EOF answer has already done this). This effect does not explain the increase in the absolute number of misses L2, although only (part) of the miss% change. However, it is important to keep in mind when looking at data, however.


From the Intel Optimization Guide (links in the tag wiki), section 2.3.5.4: Prefetching data:

Prefetch data for L2 cache and last level

Streamer . This prefixer tracks read requests from L1 cache for upstream and downstream address sequences .... When a redirect or reverse flow of requests is detected, the expected cache lines are preprogrammed. The preprogrammed line cache should be on one 4K page.

  • The stream may issue two prefetch requests for each L2 search. A stream can run up to line 20 before requesting a download.
  • Dynamically tuned to the number of outstanding requests to the kernel. If their slightly unfulfilled requests, pre-select the tape drive forward. If there are many outstanding requests, he pre-enters LLC only and less far ahead.
  • When the cache lines are far ahead, it pre-selects the cache of the last level, not L2. This method avoids replacing useful cache lines in the L2 cache.
  • Detects and supports up to 32 data access threads. For each 4-byte page, you can support one forward and one reverse stream can be supported.

This is from the Sandybridge section, but the Haswell and Skylake sections do not go into details about the changes in the prefetch. It is said to be โ€œimproved preloadโ€, but apparently it has the same basic design, only with better heuristics and / or better settings for existing heuristics, and the like.


Thanks @HansPassant: his comment on this question made me think that prefetching is not far behind.

+2
source

Source: https://habr.com/ru/post/977436/


All Articles