Is placing code and data read-only, is it using a good idea next to each other?

Question

Is placing code and data read-only, is it using a good idea next to each other?

When writing the answer in the lookup table for another question, I was reminded of something that I always think about: is it wise to find a small amount of non-code data needed for the function next to the function, instead of placing the traditional approach in another section?

Let's say you have a small function that uses a small read-only lookup table. The usual approach seems to be to find the lookup table in the data section, for example .rodata , which usually places it at some distance from the text of the function itself.

For example, a simple function that calculates byte parity using a LUT with 16 inputs:

 GLOBAL parity SECTION .text parity: mov eax, edi shr edi, 4 xor eax, edi and eax, 15 movzx eax, byte [lut + eax] ret SECTION .rodata lut: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

Now the method just makes up about 16 bytes of code, and the lookup table is 16 bytes. Thus, they could easily fit into the same cache line. This is similar to a win-win - the function is always accessed through the function, and access to it is possible only when the function is accessed, so we potentially reduce the cost of calling this function in the cold from 2 to 1 cache misses by placing the code and data side by side:

 GLOBAL parity SECTION .text parity: mov eax, edi shr edi, 4 xor eax, edi and eax, 15 movzx eax, byte [lut + eax] ret lut: db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1

It is the same as before, only with a table in the .text section, immediately following function ¹ .

As far as I know, it is usually resolved in most architectures / executable formats, so what are the problems with it?

For example, can the CPU processor fetch mechanism be confused by retrieving beyond ret in my example and trying to interpret the lookup table as (nonsense) instructions?

¹ Please note that I intentionally put the table after the code, since the code is needed first, but perhaps it does not matter in the light of the critical word in the first place, and in any case the interaction is unclear.

+6

performance optimization assembly x86 lookup-tables

BeeOnRope Jan 17 '17 at 23:25

source share

2 answers

Peter Cordes · Answer 1 · 2017-11-21T09:32:22+0000

Summary I don’t think it will be worse than you could predict (no special penalties due to this corner case), but this usually should not be done due to cache / TLB split and has a few other advantages. To optimize for the cold cache case, consider using immediate data (for example, save the LUT on the stack before use, or you can use the immediate bitmap and the bt instruction).

Ideally, you can place your constants with others that are used by code that often runs before or after that code. Compilers can use profile-based optimization to find hot data and group them together (at least Intel VTune support suggests that this can help reduce the overall dTLB skip rate).

Possible benefits: L2 cache for loading data, or at least the location of a DRAM page if the function is not tiny and the caches were cold to start with.

The main drawback is cache / TLB performance. The data in the lines / pages of the code is the pollution of the L1I and iTLB cache, and the code in the lines / pages is the pollution of the L1D and dTLB cache.

The first caching rule is that it caches work . Frequently used code (and its data) often burn in the cache. Code that does not work often is often not important for performance. Attempting to optimize for the worst case this way may result in a better case becoming less likely (more L1I / L1D misses and / or more TLB omissions from including more lines / pages in both codes and data traces).

L2 and external caches are unified, but L1 is divided, as well as TLL L1 on any reasonable microarchitecture for several reasons ( physical proximity on the chip to the interface or execution of the unit, the total number of read / write ports, etc. ), but especially on x86. where in the code generated by the compiler, there is an overlap between the code and the data. All modern x86 architectures also use L2TLB, which handles misses from L1iTLB or L1dTLB. (Intel Optimization Guide calls it STLB, for the second level).

But, unlike the L2 cache, I think that Intel STLB is the victim cache for both iTLB and dTLB. (I don’t remember where I read it, and I can’t find the source for this.) If my memory is correct, the L1TLB miss that falls into the STLB exchanges records, so nothing is sent and not duplicated. When skipping at both levels, the walk page only loads the L1iTLB with the new entry. (I think the evicted entry is in the STLB, and the LRU entry from this set in the STLB is issued).

Thus, if I am right about TLB behavior on Intel processors, skipping dTLBs from movzx eax, byte [lut + eax] will be absent in STLB (if the caches were cold for a start), launching another page, even if the same page should already be Be hot on iTLB to load data that needs to be executed. At the very least, the page table entries will be hot in the L1D cache and caches of all the internal pages.

It would be possible to test this behavior using code that jumped from page to page, loading itself as data. for example, repeat this block: here: mov eax, [rip+0] / jmp here+4096 / align 4096 . Then look at the perforated counts for skipping stlb from data (not code extraction). This would make the position of the code / data in terms of 4k or 2M pages much less valuable than otherwise, but still no worse than completely separate (except for the pollution problem, which could be useful code there, reducing the total amount of page code touched).

If your function + data is not all contained in the same cache line, loading at the early stage of the function can lead to outstanding misses (to L2) for the same line from L1D (demand load) and L1I (speculative sampling code). I do not know if there are any problems with this on any x86 servers. I would suggest, probably, no worse than usual, and, I hope, better than outstanding, missed two different lines. I would suggest that the hardware does not start any slow corner processing for this, but I have not tested.

If the end of the + data function is on the next page, you can even get both iTLB and dTLB to skip parallel for the same page, from load load + sample code.

However, I think that the data downloaded from the same cache line as the current code usually falls into L2 (although it is possible that it is still hot in L1I, but it is derived from L2 and, possibly, even from L3 on processors, such as Skylake-AVX512, where L3 does not include). Sometimes it can be worth inflating a working dataset and a working cache set that comes from mixing on one line.

Non-x86:

ARM compilers (and collectors for pseudo-instructions ldr r0, =constant ) use literal pools to load constants larger than 16 bits with small offsets associated with the PC. I think that they often end on the same page, and sometimes on the same cache line as the code. Obviously, ARM microarchitectures are designed to efficiently execute such code (with the exception of the unused I-cache / D-cache space). But, apparently, the merit of the code / number of instructions is usually worth it. I'm not sure why this is common in ARM, but not in other RISC ISAs (at least I think this is not common in MIPS / PowerPC). Modern ARM has good support for creating arbitrary 32-bit constants with two instructions, and many bit patterns can be created with a single command (using a barrel switch with immediate mov or mvn ).

But there is no reason to expect that the x86 microarchitecture is particularly careful to handle such cases more efficiently than the default, because this pattern is not common in x86. It is not used by compilers, and the only RIP relative addressing mode uses the rel32 offset, so the code size for placing data very close to the code may not even be preferable. Only terrain is beneficial for L3 / L2 (and DRAM) pages.

This does not mean that we should expect it to be slow, only we cannot deduce the behavior of x86 from the fact that ARM processors must support it efficiently. ARM processors that use / support paging may have a TLB distribution policy that supports this pattern, for example, allocation of an L2TLB record when iTLB is skipped. (If they use a layered TLB at all).

For example, is it possible to confuse the CPU processor fetch mechanism by retrieving beyond ret in my example and trying to interpret the lookup table as (meaningless) instructions?

Speculative execution outside of ret usually not a problem. It seems.

I tried to test it once (with a rather bad test, which I did not put much effort into), but I could not find any effect. Perhaps I did not have a test large enough to defeat branch prediction for ret , or speculative execution would not continue beyond ret , as is done for other indirect transitions. (Even if the call instruction is the first instruction of the function, and the called party is adjacent to this function, the correct return address is after call , and not to start the call again. Thus, speculative execution of the next ret instructions can be useful only when then pushes the fake return address onto the stack.)

If the last command before your data was an indirect jmp , then it would be wise to worry about how the data is decoded as instructions. You can block speculative execution by placing int3 or ud2 after it in front of your data, as recommended by the Intel Optimization Guide :

3.4.1.6 Choosing a branch type

blah blah fall-through is the default prediction for indirect jumps (but they don't mention ret ). Bogus instructions can slow branch recovery.

In addition, data directly associated with indirect branches can be displayed as branches on the branch prediction hardware, which can depart to execute other page data. This can lead to subsequent self-modifying code issues.
Assembly / compiler coding. Rule 14. (M impact, L generality) . When indirect branches are present, try to set the most probable goal of the indirect branch immediately after the indirect branch. Alternatively, if indirect branches are common, but they cannot be predicted by a branch then follow the indirect branch with the ud2 instruction, which will stop the processor from decoding down the path.

Read-only access should be perfect for the most custom pipeline. Write access is only nearby (possibly within 2k or 4k) EIP / RIP cause a self-tuning code-core nuclear channel / pipe. (Therefore, obviously, do not use this for static data, not for const , which you usually cannot, as code pages are usually displayed in read / exec, but not written.)

If your LUT is small enough, use it as immediate data instead of load

If the performance of the cold cache is important, you can save your LUT on the stack using the instructions of the mov r64, imm64 / mov [m64], r64 . (Or maybe mov r/m32, imm32 ).

Direct bitmaps are great for the bt command. As @Ross points out, you can do

 mov eax, 0x5555 bt eax, edi setc al

Or with a bitmap as the immediate operand with the test statement:

 xor eax, eax bts eax, edi ; 1U << arg test eax, 0x5555 setc al

Compilers will use this trick to switch when many case shortcuts run the same code, as in this case . On Godbolt with gcc and clang .

Another example: a vowel / consonant bitmap in response to a code golf to classify strings depending on whether their vowel / consonant pattern is palindromic.

In really hot functions, it is often better to load instead of mov-immediate, especially if it saves several mov instructions. But even saving one cube with a smooth domain using the memory operand for the ALU command may be worth it, so there is a trade-off between the hot and cold cache. (But never use bt with a memory operand, its performance is garbage due to the crazy CISC semantics for indexing a bit string instead of wrapping in dword or qword, selected by the addressing mode, in the way that it is transferred using register assignment.)

Or just calculate instead of using LUT. test eax,eax / setp al , since parity (low byte only) is supported on x86 hardware. Other architectures with hardware popcnt may use this and accept a low bit for even / odd parity.

But other problems can save a lot of work with LUT or a small vector constant. (Maybe compressed to boot with a broadcast load, such as movddup or vpbroadcastd , or to expand for each item like pmovsx / pmovzx )

BeeOnRope · Answer 2 · 2017-01-17T23:26:17+0000

One thing that I can think of is that, although it makes good use of L2 and higher level caches, it is not so clear to win in L1 data caches and instructions on architectures where they are separated by ¹ since the line will be displayed as in L1D caches, and in L1I, losing space at each place (for example, the space used to cache the lookup table in L1I is lost).

¹ That is, almost all major architectures today are almost all modified Harvard ...

Is placing code and data read-only, is it using a good idea next to each other?

Non-x86:

If your LUT is small enough, use it as immediate data instead of load

More articles: