I don’t think that filling out a loop with conditional outputs is usually enough. This breaks down most of the instruction planning, which allows deployment. Most often, check in advance that the cycle must contain at least n iterations before entering the expanded section.
To achieve this, the compiler can generate a complex preamble and postamble to align the loop data for better vectorization or better instruction planning and handle the rest of the iterations that are not evenly divided into the expanded section of the loop.
It may happen (the worst possible case) that the cycle only runs zero or once, or perhaps twice in exceptional circumstances. Then only a small part of the cycle will be executed, but many additional tests will be performed for this. Worse; The alignment preamble may mean that different branching conditions arise in different calls, causing additional traps with an incorrect branch prediction.
All of them are intended to be canceled at a large number of iterations, but this does not happen for short cycles.
In addition, you have an increased code size, where all of these deployed loops together contribute to the decrease in icache performance.
And some special-case architectures have very short loops to use their internal buffers without even referring to the cache.
And modern architectures have a fairly extensive reordering of commands, even when accessing memory, which means that reordering the loop compiler may not bring any additional benefits even in the best case.
source share