Is it possible to find out the cache miss address?

Whenever a cache failure occurs, is it possible to find out the address of this missing cache line? Are there any hardware performance counters in modern processors that can provide such information?

+6
source share
2 answers

Yes, on modern Intel hardware, there are accurate memory fetch events that track not only the instruction address, but also the data address. These events also include a lot of other information, such as the level of the cache hierarchy that was granted access, shared latency, etc.

You can use perf mem to select this information and create a report.

For example, the following program:

 #include <stddef.h> #define SIZE (100 * 1024 * 1024) int p[SIZE] = {1}; void do_writes(volatile int *p) { for (size_t i = 0; i < SIZE; i += 5) { p[i] = 42; } } void do_reads(volatile int *p) { volatile int sink; for (size_t i = 0; i < SIZE; i += 5) { sink = p[i]; } } int main(int argc, char **argv) { do_writes(p); do_reads(p); } 

compiled with:

 g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test 

and execute with:

 sudo perf mem record -U ./perf-mem-test && sudo perf mem report 

Generates a memory access report sorted by delay as follows:

perf-mem report output

The Data Symbol column shows the address that the load targeting was aimed at - most of them are displayed as something like p+0xa0658b4 , which means the offset 0xa0658b4 from the beginning of p , which makes sense when reading the code and writing p . The list is sorted by "local weight", which is access latency in reference loops 1 .

Note that the recorded information is just a sample of memory access: recording each miss will usually be too much information. In addition, it only logs loads with a delay of 30 cycles or more by default, but you can apparently configure this with command line arguments.

If you are only interested in accessing this error at all levels of the cache, you are looking for the lines "Local RAM hit" 2 . Perhaps you can limit the selection to caching misses only - I'm sure Intel memory fetching material supports this, and I think you can tell perf mem to see only misses.

Finally, note that here I use the -U argument after record , which instructs perf mem write only events in user space. By default, it will include kernel events that may or may not be useful to yours. For an example program, there are many kernel events related to copying the p array from a binary file to the process’s writable memory.

Keep in mind that I specially organized my program so that the global array p falls into the initialized .data section (binary file is ~ 400 MB!), So that it is displayed with the correct character in the listing. In most cases, your process will access a dynamically allocated or memory stack that will simply give you a raw address. Regardless of whether you can map this back to a significant object, it depends on whether you are tracking enough information to make this possible.


1 I think this is in reference loops, but can I be wrong, and the kernel can already convert it to nanoseconds?

2 The “Local” and “hit” part here refers to the fact that we got into RAM attached to the current core, that is, we did not go into RAM connected to another socket in a configuration with several NUMA sockets.

+4
source

If you want to know the exact virtual or physical address of the proxy cache each on a particular processor, it will be very difficult, and sometimes impossible. But you will most likely be interested in expensive memory access patterns; those patterns that have long delays because they skip one or more levels of the cache subsystem. Please note that it is important to keep in mind that misses in the cache of one processor may be related to the cache of another, depending on the design details of each processor and depending on the operating system.

There are several ways to find such patterns, usually two are used. One of them is to use a simulator such as gem5 or Sniper . Another is to use hardware performance events. Events that represent cache misses are available, but they do not contain any details about why and where the error occurred. However, using the profiler, you can roughly associate cache misses, as reported by the corresponding hardware performance events, with instructions that triggered them, which, in turn, can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL . The results obtained with the help of simulators and profilers may be inaccurate, so you should be careful when interpreting them.

+1
source

Source: https://habr.com/ru/post/969520/


All Articles