Machine Code Alignment

I am trying to understand the principles of machine code alignment. I have an assembler implementation that can generate machine code at runtime. I use 16-byte alignment for each destination of the branch, but it seems that this is not the best choice, since I noticed that if I remove the alignment than sometimes, the same code works faster. I think that something is related to the cache line width, so some commands are cut off by the cache line, and because of this, the CPU collides with kiosks. Therefore, if some alignment bytes are inserted in one place, they will move instructions somewhere further, passing the border of the cache boundary ...

I was hoping to introduce an automatic alignment procedure that can process the code as a whole and insert alignment in accordance with the specification of the CPU (cache line width, 32/64 bit, etc.) ...

Can someone give some hints on this procedure? As an example, the target processor may be an Intel Core i7 64-bit platform.

Thanks.

+4
source share
4 answers

I can not answer your question, because it is such a vast and complex topic. There are probably many more mechanisms here besides the size of the cache line.

However, I would like to point you to the Agner Fog website and optimization guides for compiler creators that you can find there. They contain a wealth of information about these types of items - cache lines, branch prediction, and data / code alignment.

+3
source

Paragraph (16-byte) alignment is usually the best. However, this can lead to the fact that some "local" JMP instructions cease to be local (due to bloating code). May also result in less code being cached. I would just align the main code segments, I would not align every tiny section of the / JMP routine.

+2
source

Not an expert, however ... Branches to places that will not be in the instruction cache should benefit most from alignment because you will read a whole line of instructions to fill the pipeline. Given this statement, forward branches will benefit from the first run of the function. Reverse branches (for example, “for” and “at that time”) probably will not be useful because the purpose of the branch and the following instructions have already been read in the cache. Follow the links in Martins answer.

+1
source

As mentioned earlier, this is a very complex area. Agner Fog seems like a good place to visit. As for complexity, I looked at the article here by Torbjörn Granlund in the section "Improved separation by invariant integers" and in the code that he uses to illustrate his new algorithm, the first instruction - I think - the main label nop - there is no operation. According to the comment, this greatly improves performance. Hover over your mouse.

+1
source

Source: https://habr.com/ru/post/1342605/


All Articles