Roof model: calculation of the working intensity

Say I have such a petlique for toys

float x[N]; float y[N]; for (int i = 1; i < N-1; i++) y[i] = a*(x[i-1] - x[i] + x[i+1]) 

And I assume that my cache line is 64 bytes (i.e. big enough). Then I will have (per frame) basically 2 RAM accesses and 3 FLOP:

  • 1 (cached) read access: download all 3 x[i-1], x[i], x[i+1]
  • 1 write access: save y[i]
  • 3 FLOP (1 mul, 1 add, 1 sub)

Ergo work intensity

OI = 3 FLOP / (2 * 4 BYTE)

Now, what happens if I do something like this

 float x[N]; for (int i = 1; i < N-1; i++) x[i] = a*(x[i-1] - x[i] + x[i+1]) 

Note that there is no longer y . Does this mean that now I have access to RAM without access

  • 1 (cached) read / write: download x[i-1], x[i], x[i+1] , save x[i]

or 2 more calls to RAM

  • 1 (cached) read: loading x[i-1], x[i], x[i+1]
  • 1 (cached) write: save x[i]

Since the operating intensity of the OI will be different anyway. Can anyone tell about this? Or maybe clarify some things. Thanks

+5
source share
1 answer

Disclaimer: so far I have never heard of a roof performance model. As far as I can tell, he is trying to calculate a theoretical estimate of the "arithmetic intensity" of the algorithm, which is the number of FLOPS for each byte of data. Such a measure may be useful for comparing similar algorithms, since the size N grows, but is not very useful for predicting real performance.

As a rule, modern processors can execute instructions much faster than they can retrieve / store data (this becomes much more pronounced as the data begins to grow larger than the size of the caches). Therefore, contrary to what might be expected, a cycle with a higher arithmetic intensity can work much faster than a cycle with a lower arithmetic intensity; most importantly, the N scale is the total amount of data affected (this will be done as long as the memory remains much slower than the processor, as is true in today's desktop and server systems).

In short, x86 processors, unfortunately, are too complex to be accurately described using such a simple model. Access to the memory goes through several levels of caching (usually L1, L2 and L3) before removing the RAM. Perhaps all your data fits into L1 - the second time you start your cycle (s), there may not be access to RAM at all.

And not just the data cache. Do not forget that the code is also in memory and must be loaded into the command cache. Each read / write is also performed from / to the virtual address, which is supported by the hardware TLB (which in extreme cases can cause a page error and, say, forces the OS to write the page to disk in the middle of your cycle). All this assumes that your program completely confuses the hardware (in operations without real-time, this is simply not so, since other processes and threads compete for the same limited resources).

Finally, the execution itself is not performed (directly) with reading and writing memory, but rather, the data is first loaded into registers (then the result is saved).

How does the compiler allocate registers if it tries to unroll a loop, auto-vectorization, model for scheduling commands (interleaving instructions to avoid data dependencies between instructions), etc. will also affect the actual throughput of the algorithm.

So, finally, depending on the received code, CPU model, the amount of processed data and the state of various caches, the latency of the algorithm will change by orders of magnitude. Thus, the intensity of the loop cannot be determined by checking the code (or even the assembly) alone, since the game has many other (non-linear) factors.


To answer your real question, although, as far as I see in the definition set out here , the second loop will be considered as single additional 4-byte access per iteration on average, so its OI will be θ (3N FLOPS / 4N bytes). Intuitively, this makes sense, since the data is already loaded in the cache, and the record can directly change the cache, instead of returning to the main memory (the data must ultimately be written back, however this requirement does not change from the first loop).

0
source

Source: https://habr.com/ru/post/1260173/


All Articles