How to write a C program to measure cache speed?

Write down the program and try to compare (if possible) the access time to the data from the main memory and cache.

If you can do this, then how to measure the speed of each cache level?

+6
source share
3 answers

This usually requires some knowledge of the cache geometry and its other aspects. It is also useful to have some control over the system beyond the user's easy access to it and things that depend on the implementation, such as finer time than can be provided through the standard C clock mechanism.

Here is the initial approach:

  • Write a procedure that takes a pointer to the memory, length and number of repetitions and repeatedly reads all this memory in sequential order.
  • Record a procedure that takes a pointer to memory, the length and number of repetitions and writes all this memory in sequential order several times.
  • The above procedures may have to convert their pointers to volatile to prevent the compiler from optimizing access, which otherwise has no effect.
  • Highlight a large amount of memory.
  • Call each of the above procedures, get the current time before and after each call, and call with different lengths to see the time for different lengths.

When you do this, you will usually see fast speeds (number of bytes read / written per second) for short lengths and slower speeds for longer lengths. A decrease in speed will occur where the sizes of different cache levels are exceeded. Thus, you will most likely see the cache sizes L1 and L2 reflected in the data collected using the aforementioned method.

Here are some reasons why the approach is inadequate:

  • It does not control the instructions used to read or write the cache. The C compiler can generate the load and store word instructions well, but many modern processors have instructions that can load and store 16 bytes at a time, and reading and writing can be faster with these instructions than with four byte word instructions.
  • The cache will behave differently when you access it sequentially than if you accidentally got it. Most caches make some attempt to track when data is being used, so that recently used data is cached while other data is being issued. Parts of access to real programs usually differ from the sequential operations described above.
  • In particular, sequential writes to memory can fill the entire cache line, so you don’t need to read anything from memory, while a real-world usage pattern that writes only one word to a specific location may need to be implemented by reading the cache line from memory and merges in modified bytes.
  • Competition from other processes in your system will interfere with what is in the cache and with the dimension.
+3
source

You need to come up with a heuristic that will cause a 100% (or very close) cache miss (I hope you have a cache rejection code?) And a 100% cache. Hooray, which works for level 1 cache. Now, how to do the same for level 2 and 3?

In all seriousness, it is probably not possible to do this 100% reliably without special equipment and traces associated with the processor and memory, but here is what I will do:

Write a “bunch” of material in 1 place in the memory - enough so you can be sure that it sequentially puts the L1 cache and records the time (which affects your cache, so be careful). You must make this set of records without branches to try to get rid of inconsistencies in the branch prediction. This is the best time. Now, every so often, write cache line data to a random remote place in RAM at the end of your known location and record a new time. Hope this takes longer. Continue to do this recording at different times, and I hope you will see a couple of timings that are usually grouped. Each of these groups "can" show timings for L2, L3 and memory access timings. The problem is that there are many other things that get in the way. The OS can switch you context and ruin the cache. An interruption may come and turn off after your time. There will be many things that could throw away values. But I hope you get enough signal in your data to find out if it works.

This will probably be easier to do in a simpler, built-in type system, where the OS (if any) will not bother you.

+5
source

Take a look at cachegrind-valgrind :

Cachegrind models how your program interacts with the hierarchy machine cache and (optionally). It simulates a machine with independent first-level instructions and data caches (I1 and D1), supported by a single second-level cache (L2). This exactly matches the configuration of many modern machines.

Look at the interesting questions that they are somehow related:

+2
source

Source: https://habr.com/ru/post/943733/


All Articles