Compiler Tips and Optimization Semantics

I spent the last couple of weeks optimizing a numerical algorithm. Thanks to a combination of precalculation, memory alignment, compiler tooltips and flags, and trial error experiments, I adjusted the execution time by an order of magnitude. I have not explicitly vectorized using intrinsics or using multithreading yet.

Often when working on this type of problem, there is an initialization procedure, after which many parameters become permanent. These can be filter lengths, an expression for the switch statement for the length of the loop, or iteration increments. If the parameters were known at compile time, the compiler should be able to do much more efficient optimization work, knowing exactly how to unroll loops, replace index computations with offset instructions encoded in instructions, simplify or exclude expressions at compile time, possibly exclude statements switch, etc. The most extreme way to solve this problem is to run the initialization routine (at runtime), and then run the compiler for a critical function that will be optimized using some kind of plugin,which allows you to iterate over an abstract syntax tree, replace parameters with constants, and finally dynamically reference a shared object. If the procedure is short, it can be dynamically compiled inside a binary file using a number of tools.

Moreover, I rely heavily on alignment, gcc __builtin_assume_aligned, constraint, manual loop unrolling and compiler flags to force the compiler to do what I want, given the unknown value of the parameters at compile time. I am wondering what other options are available to me that are at least close to portable. I use intrinsics as a last resort as it is not portable and does not work.In particular, how can I provide the compiler (gcc) with additional information about loop variables using language semantics, compiler extensions, or external tools so that it can do better optimization for me. In the same way, there is a way to qualify variables as having a step so that loads and storages are always aligned, making it easier to turn on the automatic vectorization and looping process.


These problems occur frequently, so I hope there is a more elegant way to solve them. The following are examples of what problems I am optimizing, but I believe that the compiler should have done for me. They are not intended for further questions.

, SIMD, . (A) / (B) . , gcc . , , , , , - ( ) . , , , .

/ , - / . , , 100. , 32, 34 . , , , , , . , , . #pragma GCC optimize ("unroll-loops"). , , N &= ~7, , 8. , N. , AVX. , , gcc . , , ( , , , ). , , , . , , , .

, AVX. , , , . , ( 16 ). 128- XMM. AVX-, , , , gcc. , , ( , AVX), . - . , , . . , , , . , , , ? , . , ( ). ( , , 48 . ...)

+3

Source: https://habr.com/ru/post/1661875/


All Articles