Failure to fulfill an order in the CPU means that the CPU can reorder the instructions in order to get better performance, which means that the CPU must do very beautiful accounting reports and the like. There are other processor approaches, such as hyperthreading.
Some bizarre compilers understand the (un) relationship of instructions to a limited extent and automatically interleave the flow of commands (perhaps a longer window than the processor sees) to make better use of the processor. Another example of this is intentional compilation with a temporary alternation of floating and whole instructions.
Now I have a highly parallel task. And I usually have an aging x86 single core processor without hyperthreading.
Is there a direct way to get my body of my "for" loop so that this highly parallel task rotates so that two (or more) iterations work together? (This is a little different from loop unwinding, as I understand it.)
My task is a "virtual machine" going through a set of instructions that I really simplify to illustrate how:
void run (int num) {
for (int n = 0; n <num; n ++) {
vm_t data (n);
for (int i = 0; i <data.len (); i ++) {
data.insn (i) .parse ();
data.insn (i) .eval ();
}
}
}Thus, the execution trace may look like this:
data (1) insn (0) parse
data (1) insn (0) eval
data (1) insn (1) parse
...
data (2) insn (1) eval
data (2) insn (2) parse
data (2) insn (2) eval
Now, I would like to be able to do two (or more) iterations explicitly in parallel:
data (1) insn (0) parse
data(2) insn(0) parse \ processor can do OOO as these two flow in
data(1) insn(0) eval /
data(2) insn(0) eval \ OOO opportunity here too
data(1) insn(1) parse /
data(2) insn(1) parse
, (, Callgrind -simulate-cache = yes) ( ), eval - . - . , , , , - , ...
- ++ parallelism?
, - - , . , ! , , - ?