Inspired by Meyers I read the computer cache and wanted to do an experiment demonstrating the things mentioned. Here is what I tried:
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
typedef uint8_t data_t;
const uint64_t max = (uint64_t)1<<30;
const unsigned cycles = 1000;
const uint64_t step = 63;
volatile data_t acu = 0;
volatile data_t *arr = malloc(sizeof(data_t) * max);
for (uint64_t i = 0; i < max; ++i)
arr[i] = ~i;
for(unsigned c = 0; c < cycles; ++c)
for (uint64_t i = 0; i < max; i += step)
acu += arr[i];
printf("%lu\n", max);
return 0;
}
Anbd, then simple gcc --std=c99 -O0 test.c && time ./a.out. I checked, and my processor cache lines are 64 bytes long. step = 64When assigning , I tried to skip cache misses more often than with step=63.
However, it step=63runs a little faster. I suspect that I am a “victim” of prefetching because my RAM is read sequentially.
How can I improve my example of moving an array to demonstrate the cost of cache misses?
Vorac