Premise
I want to do some kind of calculation involving klong data vectors (each of the lengths n) that I get in main memory, and write some result back to main memory. For simplicity, we also assume that the calculation is simple.
for(i = 0; i < n; i++)
v_out[i] = foo(v_1[i],v_2[i], ... ,v_k[i])
or maybe
for(i = 0; i < n; i++)
v_out[i] = bar_k(...bar_2(bar_1(v_1[i]),v_2[i]), ... ),v_k[i])
(this is not code, this is pseudocode.) foo()and bar_i()functions have no side effects. kis constant (known at compile time), nknown only before this calculation (and it is relatively large - at least several times larger than the entire size of the L2 cache and possibly larger).
, x86_64 (Intel AMD, --, , ). , , foo() ( bar_i()) , .. n ( k x n) invocations foo() ( bar_i()).
, :
- , .
- , .
bar_j(...bar_1(v_1[i])...), L1, , v_ {j + 1} [i]... v_k [i] . L2.- L1 , . L2.
- .
- , .
- ---- v_out ( , , , , ).
:
- . , .
- , , .
- bar_i , w.r.t. v_out.