How to demonstrate the effect of command cache restrictions

Question

How to demonstrate the effect of command cache restrictions

My idea was to give an elegant code example that would demonstrate the effect of command cache restrictions. I wrote the following code fragment that creates a large number of identical functions using template metaprogramming.

volatile int checksum; void (*funcs[MAX_FUNCS])(void); template <unsigned t> __attribute__ ((noinline)) static void work(void) { ++checksum; } template <unsigned t> static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); } template <> void create<0>(void) { } int main() { create<MAX_FUNCS>(); for (unsigned range = 1; range <= MAX_FUNCS; range *= 2) { checksum = 0; for (unsigned i = 0; i < WORKLOAD; ++i) { funcs[i % range](); } } return 0; }

The outer loop changes the number of different functions that will be called using the jump table. For each pass of the cycle, the time spent on calling WORKLOAD functions is WORKLOAD . What are the results? The following table shows the average runtime for calling a function relative to the range used. The blue line shows the data measured on a Core i7 machine. The comparative measurement depicted by the red line was carried out on a Pentium 4 machine. But when it comes to interpreting these lines, I seem to be struggling somehow ...

chart

The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within the range exceeds the capacity of one cache level on the machine under test, which does not have a dedicated instruction cache. However, for very small ranges (less than 4 in this case), the operating time increases with the number of functions. This may be due to the efficiency of branch prediction, but since each function call reduces to an unconditional transition in this case, I’m not sure that there should be no branching penalty at all.

The blue curve behaves in a completely different way. The operating time is constant for small ranges and then increases logarithmically. However, for large ranges, the curve again approaches the constant asymptote. How to accurately explain the qualitative differences between the two curves?

I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and g++ -std=c++11 -ftemplate-depth=65536 not optimizing the compiler.

Any help would be greatly appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!

+6

c ++ performance branch-prediction caching template-meta-programming

Rene R. Sep 04 '13 at 1:56

source share

1 answer

Leeor · Answer 1 · 2013-10-06T12:29:10+0000

First, let me say that I really like how you approached this problem, this is a really neat solution for intentionally inflating code. However, there may still be several possible problems with your test -

You also measure the warm-up time. you have not indicated where you placed your time checks, but if it is only around the inner loop, the first time you reach the / 2 range, you will still enjoy the warm-up of the previous outer iteration. Instead, measure only warm performance - start each internal iteration several times (add another loop in the middle) and take the timestamp after only 1-2 rounds.
You claim that you have several cache levels, but your L1 cache is only 32 thousand, where your schedule ends. Even assuming this is considered in terms of "range", each function is ~ 21 bytes (at least on my gcc 4.8.1), so you will reach no more than 256 KB, which will only scratch the size of your L2.
You did not specify your processor model (now i7 has at least 4 generations on the market, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example, an additional uop-cache, because Sandybrige with complex rules and storage conditions. Your baseline also complicates things, if I remember correctly, P4 had a trace cache, which could also lead to any kind of performance impact. You should check the option to disable them, if possible.
Don’t forget the TLB - although it probably doesn’t play a role here in such a tightly organized code, the number of unique 4k pages should not exceed ITLB (128 entries) and even up to what you may encounter conflicts if your OS does not distribute physical code pages enough, to avoid ITLB collisions.

How to demonstrate the effect of command cache restrictions

More articles: