Yes, on modern Intel hardware, there are accurate memory fetch events that track not only the instruction address, but also the data address. These events also include a lot of other information, such as the level of the cache hierarchy that was granted access, shared latency, etc.
You can use perf mem to select this information and create a report.
For example, the following program:
#include <stddef.h> #define SIZE (100 * 1024 * 1024) int p[SIZE] = {1}; void do_writes(volatile int *p) { for (size_t i = 0; i < SIZE; i += 5) { p[i] = 42; } } void do_reads(volatile int *p) { volatile int sink; for (size_t i = 0; i < SIZE; i += 5) { sink = p[i]; } } int main(int argc, char **argv) { do_writes(p); do_reads(p); }
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and execute with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Generates a memory access report sorted by delay as follows:

The Data Symbol column shows the address that the load targeting was aimed at - most of them are displayed as something like p+0xa0658b4 , which means the offset 0xa0658b4 from the beginning of p , which makes sense when reading the code and writing p . The list is sorted by "local weight", which is access latency in reference loops 1 .
Note that the recorded information is just a sample of memory access: recording each miss will usually be too much information. In addition, it only logs loads with a delay of 30 cycles or more by default, but you can apparently configure this with command line arguments.
If you are only interested in accessing this error at all levels of the cache, you are looking for the lines "Local RAM hit" 2 . Perhaps you can limit the selection to caching misses only - I'm sure Intel memory fetching material supports this, and I think you can tell perf mem to see only misses.
Finally, note that here I use the -U argument after record , which instructs perf mem write only events in user space. By default, it will include kernel events that may or may not be useful to yours. For an example program, there are many kernel events related to copying the p array from a binary file to the process’s writable memory.
Keep in mind that I specially organized my program so that the global array p falls into the initialized .data section (binary file is ~ 400 MB!), So that it is displayed with the correct character in the listing. In most cases, your process will access a dynamically allocated or memory stack that will simply give you a raw address. Regardless of whether you can map this back to a significant object, it depends on whether you are tracking enough information to make this possible.
1 I think this is in reference loops, but can I be wrong, and the kernel can already convert it to nanoseconds?
2 The “Local” and “hit” part here refers to the fact that we got into RAM attached to the current core, that is, we did not go into RAM connected to another socket in a configuration with several NUMA sockets.