Summary I donβt think it will be worse than you could predict (no special penalties due to this corner case), but this usually should not be done due to cache / TLB split and has a few other advantages. To optimize for the cold cache case, consider using immediate data (for example, save the LUT on the stack before use, or you can use the immediate bitmap and the bt
instruction).
Ideally, you can place your constants with others that are used by code that often runs before or after that code. Compilers can use profile-based optimization to find hot data and group them together (at least Intel VTune support suggests that this can help reduce the overall dTLB skip rate).
Possible benefits: L2 cache for loading data, or at least the location of a DRAM page if the function is not tiny and the caches were cold to start with.
The main drawback is cache / TLB performance. The data in the lines / pages of the code is the pollution of the L1I and iTLB cache, and the code in the lines / pages is the pollution of the L1D and dTLB cache.
The first caching rule is that it caches work . Frequently used code (and its data) often burn in the cache. Code that does not work often is often not important for performance. Attempting to optimize for the worst case this way may result in a better case becoming less likely (more L1I / L1D misses and / or more TLB omissions from including more lines / pages in both codes and data traces).
L2 and external caches are unified, but L1 is divided, as well as TLL L1 on any reasonable microarchitecture for several reasons ( physical proximity on the chip to the interface or execution of the unit, the total number of read / write ports, etc. ), but especially on x86. where in the code generated by the compiler, there is an overlap between the code and the data. All modern x86 architectures also use L2TLB, which handles misses from L1iTLB or L1dTLB. (Intel Optimization Guide calls it STLB, for the second level).
But, unlike the L2 cache, I think that Intel STLB is the victim cache for both iTLB and dTLB. (I donβt remember where I read it, and I canβt find the source for this.) If my memory is correct, the L1TLB miss that falls into the STLB exchanges records, so nothing is sent and not duplicated. When skipping at both levels, the walk page only loads the L1iTLB with the new entry. (I think the evicted entry is in the STLB, and the LRU entry from this set in the STLB is issued).
Thus, if I am right about TLB behavior on Intel processors, skipping dTLBs from movzx eax, byte [lut + eax]
will be absent in STLB (if the caches were cold for a start), launching another page, even if the same page should already be Be hot on iTLB to load data that needs to be executed. At the very least, the page table entries will be hot in the L1D cache and caches of all the internal pages.
It would be possible to test this behavior using code that jumped from page to page, loading itself as data. for example, repeat this block: here: mov eax, [rip+0]
/ jmp here+4096
/ align 4096
. Then look at the perforated counts for skipping stlb from data (not code extraction). This would make the position of the code / data in terms of 4k or 2M pages much less valuable than otherwise, but still no worse than completely separate (except for the pollution problem, which could be useful code there, reducing the total amount of page code touched).
If your function + data is not all contained in the same cache line, loading at the early stage of the function can lead to outstanding misses (to L2) for the same line from L1D (demand load) and L1I (speculative sampling code). I do not know if there are any problems with this on any x86 servers. I would suggest, probably, no worse than usual, and, I hope, better than outstanding, missed two different lines. I would suggest that the hardware does not start any slow corner processing for this, but I have not tested.
If the end of the + data function is on the next page, you can even get both iTLB and dTLB to skip parallel for the same page, from load load + sample code.
However, I think that the data downloaded from the same cache line as the current code usually falls into L2 (although it is possible that it is still hot in L1I, but it is derived from L2 and, possibly, even from L3 on processors, such as Skylake-AVX512, where L3 does not include). Sometimes it can be worth inflating a working dataset and a working cache set that comes from mixing on one line.
Non-x86:
ARM compilers (and collectors for pseudo-instructions ldr r0, =constant
) use literal pools to load constants larger than 16 bits with small offsets associated with the PC. I think that they often end on the same page, and sometimes on the same cache line as the code. Obviously, ARM microarchitectures are designed to efficiently execute such code (with the exception of the unused I-cache / D-cache space). But, apparently, the merit of the code / number of instructions is usually worth it. I'm not sure why this is common in ARM, but not in other RISC ISAs (at least I think this is not common in MIPS / PowerPC). Modern ARM has good support for creating arbitrary 32-bit constants with two instructions, and many bit patterns can be created with a single command (using a barrel switch with immediate mov
or mvn
).
But there is no reason to expect that the x86 microarchitecture is particularly careful to handle such cases more efficiently than the default, because this pattern is not common in x86. It is not used by compilers, and the only RIP relative addressing mode uses the rel32
offset, so the code size for placing data very close to the code may not even be preferable. Only terrain is beneficial for L3 / L2 (and DRAM) pages.
This does not mean that we should expect it to be slow, only we cannot deduce the behavior of x86 from the fact that ARM processors must support it efficiently. ARM processors that use / support paging may have a TLB distribution policy that supports this pattern, for example, allocation of an L2TLB record when iTLB is skipped. (If they use a layered TLB at all).
For example, is it possible to confuse the CPU processor fetch mechanism by retrieving beyond ret in my example and trying to interpret the lookup table as (meaningless) instructions?
Speculative execution outside of ret
usually not a problem. It seems.
I tried to test it once (with a rather bad test, which I did not put much effort into), but I could not find any effect. Perhaps I did not have a test large enough to defeat branch prediction for ret
, or speculative execution would not continue beyond ret
, as is done for other indirect transitions. (Even if the call
instruction is the first instruction of the function, and the called party is adjacent to this function, the correct return address is after call
, and not to start the call
again. Thus, speculative execution of the next ret
instructions can be useful only when then pushes the fake return address onto the stack.)
If the last command before your data was an indirect jmp
, then it would be wise to worry about how the data is decoded as instructions. You can block speculative execution by placing int3
or ud2
after it in front of your data, as recommended by the Intel Optimization Guide :
3.4.1.6 Choosing a branch type
blah blah fall-through is the default prediction for indirect jumps (but they don't mention ret
). Bogus instructions can slow branch recovery.
In addition, data directly associated with indirect branches can be displayed as branches on the branch prediction hardware, which can depart to execute other page data. This can lead to subsequent self-modifying code issues.
Assembly / compiler coding. Rule 14. (M impact, L generality) . When indirect branches are present, try to set the most probable goal of the indirect branch immediately after the indirect branch. Alternatively, if indirect branches are common, but they cannot be predicted by a branch then follow the indirect branch with the ud2
instruction, which will stop the processor from decoding down the path.
Read-only access should be perfect for the most custom pipeline. Write access is only nearby (possibly within 2k or 4k) EIP / RIP cause a self-tuning code-core nuclear channel / pipe. (Therefore, obviously, do not use this for static data, not for const
, which you usually cannot, as code pages are usually displayed in read / exec, but not written.)
If your LUT is small enough, use it as immediate data instead of load
If the performance of the cold cache is important, you can save your LUT on the stack using the instructions of the mov r64, imm64
/ mov [m64], r64
. (Or maybe mov r/m32, imm32
).
Direct bitmaps are great for the bt
command. As @Ross points out, you can do
mov eax, 0x5555 bt eax, edi setc al
Or with a bitmap as the immediate operand with the test
statement:
xor eax, eax bts eax, edi ; 1U << arg test eax, 0x5555 setc al
Compilers will use this trick to switch
when many case shortcuts run the same code, as in this case . On Godbolt with gcc and clang .
Another example: a vowel / consonant bitmap in response to a code golf to classify strings depending on whether their vowel / consonant pattern is palindromic.
In really hot functions, it is often better to load instead of mov-immediate, especially if it saves several mov
instructions. But even saving one cube with a smooth domain using the memory operand for the ALU command may be worth it, so there is a trade-off between the hot and cold cache. (But never use bt
with a memory operand, its performance is garbage due to the crazy CISC semantics for indexing a bit string instead of wrapping in dword or qword, selected by the addressing mode, in the way that it is transferred using register assignment.)
Or just calculate instead of using LUT. test eax,eax
/ setp al
, since parity (low byte only) is supported on x86 hardware. Other architectures with hardware popcnt may use this and accept a low bit for even / odd parity.
But other problems can save a lot of work with LUT or a small vector constant. (Maybe compressed to boot with a broadcast load, such as movddup
or vpbroadcastd
, or to expand for each item like pmovsx
/ pmovzx
)