Does the function with instructions in front of the entry point label have problems for something (binding)?

This is really a linker / object file issue, but an assembly tag, since compilers never do this. (Although perhaps they could!)

Consider this function, where I want to handle one special case with a block of code that is on the same I-cache line as the function entry point. To avoid jumping in the usual fast path, is it safe (for example, links / shared libraries / other tools that I did not think about) to put the code for it before the global function symbol?

I know this is stupid / overkill, see below. Mostly I was just curious. Regardless of whether this method is useful for creating code that actually works faster in practice, I think this is an interesting question.

.globl __nextafter_pjc // double __nextafter_pjc(double x, double y) .p2align 6 // unrealistic 64B alignment, just for the sake of argument // GNU as local labels have the form .L... .Lequal_or_unordered: jp .Lunordered movaps %xmm1, %xmm0 # ISO C11 requires returning y, not x. (matters for -0.0 == +0.0) ret ######### Function entry point / global symbol here ############# // .p2align something // tuning for Sandybridge, maybe best to just leave this unaligned, since it only 6B from the alignment boundary nextafter_pjc: ucomisd %xmm1, %xmm0 je .Lequal_or_unordered xorps %xmm3, %xmm3 comisd %xmm3, %xmm0 // x==+/0.0 can be a special case: the sign bit may change je .Lx_zero movq %xmm0, %rax ... // some mostly-branchless bit-ninjutsu that I have no idea how I'd get gcc to emit from C ret .Lx_zero: ... ret .Lunordered: ... ret 

(BTW, I was messing with asm for nextafter because I was curious how glibc was implemented. The current implementation is compiling some really nasty code from a ton of branches. For example, checking both inputs for NaN should be done using FP comparison, because this is super-fast esp. in a case other than NaN.)


In the disassembly output, instructions before the label are grouped after previous functions. eg.

 0000000000400ad0 <frame_dummy>: ... 400af0: 5d pop %rbp 400af1: e9 7a ff ff ff jmpq 400a70 <register_tm_clones> 400af6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 400afd: 00 00 00 400b00: 7a 56 jp 400b58 <__nextafter_pjc+0x52> 400b02: 0f 28 c1 movaps %xmm1,%xmm0 400b05: c3 retq 0000000000400b06 <__nextafter_pjc>: 400b06: 66 0f 2e c1 ucomisd %xmm1,%xmm0 400b0a: 74 f4 je 400b00 <frame_dummy+0x30> 400b0c: 0f 57 db xorps %xmm3,%xmm3 400b0f: 66 0f 2f c3 comisd %xmm3,%xmm0 400b13: 74 4b je 400b60 <__nextafter_pjc+0x5a> 400b15: 66 48 0f 7e c0 movq %xmm0,%rax ... 

Note that the 4th command in the main body, comisd , starts with 400b0f (and is not completely contained in the first block aligned to 16B, which contains the function entry point). Thus, this may not be entirely optimal for the fetch and decode command for the accelerated no-take-branches path, to do it this way. This is just an example.

So this works, even at the beginning of the file. It confuses objdump and is not ideal in gdb (but this is not a big problem). ELF object files do not record character sizes anyway, so nm --print-size does nothing. (And nm --size-sort --print-size , which is trying to calculate character sizes, strangely didn't include my function.)

I know little about Windows object files. Is anything worse going on?

I'm a little worried about the correctness here: is it trying to ever try to copy individual functions from object files by taking bytes from their character address to the next character address ? Regular library archives ( ar for static libraries) and linkers copy entire object files, right? Otherwise, they could not be sure that they were copying all the necessary static data.


This function is probably called infrequently, and we want to minimize cache pollution (I $, uop-cache, branch-predors). And if anything, optimize for the case without caching with cold branch predictors.

This is probably stupid, because the case of caching can happen infrequently. However, if many functions are optimized in this way, the total cache size will decrease, and perhaps all of them will be included in the cache.

Please note that the latest Intel processors do not perform static branch prediction at all, therefore there is no reason to support forward branches for normally unaccepted branches.

Instead of defaulting to backtracking / not taking forward for "unknown" branches that are not in BHT, my understanding of the Agner Fog microarch doc (branch prediction chapter) is that they do not check if the branch is "new" " or not. They simply use any entry already in BHT without clearing it. This may not be entirely true, though for Nehalem .

+2
source share
1 answer

There is an easy way to make this look completely normal: put a non-global label in front of the code. This makes it look like (or actually) an auxiliary static function.

Non-global functions can call each other using any calling convention they want. C compilers can even make such code with link-time / whole-program optimizations, or even just optimizing static functions inside a compilation unit. Transitions (instead of calls) to another function are already used to optimize the tail call.

The code "helper function" can go to the main function somewhere except the entry point. I am sure that this is not a problem for linkers. This would only break if the linker changed the distance between the assistant and the main function (by inserting something between them) without adjusting the relative jumps that cross the gap that it has expanded. I don’t think that any linker would put anything in this way in the first place, and this is clearly an error without fixing any branches.

I'm not sure if there are any pitfalls when creating .size ELF metadata. I think I read that this is important for functions that will be associated with shared libraries.

The following should work perfectly with any tool that deals with object files:

 .globl __nextafter_pjc // double __nextafter_pjc(double x, double y) .p2align 6 // unrealistic 64B alignment, just for the sake of argument nextafter_helper: # not a local label, but not .globl either .Lequal_or_unordered: jp .Lunordered movaps %xmm1, %xmm0 # ISO C11 requires returning y, not x. (matters for -0.0 == +0.0) ret ######### Function entry point / global symbol here ############# // .p2align something? __nextafter_pjc: ucomisd %xmm1, %xmm0 je .Lequal_or_unordered ... ret 

We don’t need a simple label and a “local” label, but using different labels for different purposes means that when changing things, fewer changes are needed. (for example, you can put the .Lequal_or_unordered block in another place without renaming it back to .L and changing all the transitions that targeted it.) nextafter_equal_or_unordered will work as one name.

+1
source

Source: https://habr.com/ru/post/1247317/


All Articles