In which case the "unroll-loops" script will not make the result code faster?

Adapted from the GCC manual:

-funroll-loops Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop. This option makes code larger, and may or may not make it run faster. 

According to my understanding, reversal loops will be relieved of branching instructions in the given code, I believe that it is more useful for CPU pipelines.

But why is it "may not speed up the work"?

+4
source share
6 answers

First of all, it may not make any difference; if your condition is “simple” and fulfilled many times, the branch predictor should pick it up quickly and always correctly predict the branch to the end of the loop, making the “rolled” code run almost as fast as the deployed code.

In addition, on non-pipelined processors, the cost of a branch is quite small, so this optimization may not be practical, and considerations of code size can be much more important (for example, when compiling for a microcontroller - remember that gcc parameters vary from AVR microphones for supercomputers).

Another case where a U-turn cannot accelerate the cycle is when the body of the cycle is much slower than the cycle itself - if, for example, you have syscall in the body cycle, the threads of the cycle will be insignificant compared to the system call.

As for when this can slow down your code, increasing the code can slow it down - if your code is no longer suitable for the cache page / memory / ... you will have cache / page / ... and the processor will have to wait, when the memory will extract the code before executing it.

+4
source

The answers are still very good, but I will add one more thing that has not yet been touched: there are slots for branch predictors. If your loop contains a branch, and it does not expand, it consumes only one branch predictor slot, so it will not crowd out other predictions that the processor made in external loops, sister loops, or the caller. However, if the body of the loop is duplicated many times by expanding, each copy will contain a separate branch that consumes the predictor slot. This kind of performance is easily invisible, because, like problems with cache allocation, it will not be visible in most isolated artificial loop performance measurements. Instead, it will manifest itself as damage to the performance of other code.

As a great example, the fastest strlen on x86 (even better than the best asm I've seen) is an insanely deployed loop that just does:

 if (!s[0]) return s-s0; if (!s[1]) return s-s0+1; if (!s[2]) return s-s0+2; /* ... */ if (!s[31]) return s-s0+31; 

However, this will lead to a break through the branch predictor slots, so for some real purposes, some kind of vector approach is preferable.

+1
source

I don’t think that filling out a loop with conditional outputs is usually enough. This breaks down most of the instruction planning, which allows deployment. Most often, check in advance that the cycle must contain at least n iterations before entering the expanded section.

To achieve this, the compiler can generate a complex preamble and postamble to align the loop data for better vectorization or better instruction planning and handle the rest of the iterations that are not evenly divided into the expanded section of the loop.

It may happen (the worst possible case) that the cycle only runs zero or once, or perhaps twice in exceptional circumstances. Then only a small part of the cycle will be executed, but many additional tests will be performed for this. Worse; The alignment preamble may mean that different branching conditions arise in different calls, causing additional traps with an incorrect branch prediction.

All of them are intended to be canceled at a large number of iterations, but this does not happen for short cycles.

In addition, you have an increased code size, where all of these deployed loops together contribute to the decrease in icache performance.

And some special-case architectures have very short loops to use their internal buffers without even referring to the cache.

And modern architectures have a fairly extensive reordering of commands, even when accessing memory, which means that reordering the loop compiler may not bring any additional benefits even in the best case.

+1
source

For example, the expanded body of a function is larger than the cache. Reading from memory is obviously slower.

0
source

Let's say you have a loop with 25 instructions and iterate 1000 times. The additional resources needed to process 25,000 instructions can greatly override the pain caused by branching.

It is also important to note that many types of branches of the cycle are very painless, since the processor obtained good results when predicting branches for simpler situations. For example, eight iterations are probably more efficient, but even 50 is probably best left on the CPU. Note that the compiler is probably better at guessing which is better than you.

0
source

Scrolling loops should always make the code faster. The tradeoff between fast code and large code size. Close loops (relatively small amounts of code executed in the body of the loop), which are executed a significant number of times, can benefit from reversal, removing all the overhead of the loop and allowing the conveyor to do its job. Loops that go through many iterations can be deployed on a large number of additional code - faster, but perhaps unacceptably more to increase performance. Cycles with a large number of movements in the body may not bring a significant effect from the unfolding - the overhead of the cycle becomes small compared to everything else.

-1
source

Source: https://habr.com/ru/post/1486722/


All Articles