How to set function arguments in assembly at run time in a 64-bit application on Windows?

I am trying to set the arguments using the assembler code that is used in a generic function. The arguments to this generic function, which is resident in the dll, are not known at compile time. At run time, a pointer to this function is determined using the GetProcAddress function. However, his arguments are unknown. At runtime, I can determine the arguments - both the value and the type - using the data file (and not the header file or anything that can be included or compiled). I found a good example of how to solve this problem for 32-bit ( C Pass arguments as a list of void pointers for the imported function from LoadLibrary () ), but for 64-bit this example does not work because you cannot fill the stack. but you have to fill in the registers. So I tried to use assembly code to populate the registers, but still have not succeeded. I use C code to invoke assembly code. I am using VS2015 and MASM (64 bit). The C code below works fine, but the build code does not work. So what is wrong with the build code? Thanks in advance.

C code:

... void fill_register_xmm0(double); // proto of assembly function ... // code determining the pointer to a func returned by the GetProcAddress() ... double dVal = 12.0; int v; fill_register_xmm0(dVal); v = func->func_i(); // integer function that will use the dVal ... 

assembly code in another .asm file (MASM syntax):

 TITLE fill_register_xmm0 .code option prologue:none ; turn off default prologue creation option epilogue:none ; turn off default epilogue creation fill_register_xmm0 PROC variable: REAL8 ; REAL8=equivalent to double or float64 movsd xmm0, variable ; fill value of variable into xmm0 ret fill_register_xmm0 ENDP option prologue:PrologueDef ; turn on default prologue creation option epilogue:EpilogueDef ; turn on default epilogue creation END 
+2
source share
2 answers

So, you need to call the function (in the DLL), but only at runtime you can determine the number and type of parameters. Then you need to transfer the parameters both to the stack and to the registers, depending on the agreement on the binary application / call interface.

I would use the following approach: some component of your program determines the number and type of parameters. Suppose he creates a list of {type, value}, {type, value}, ...

You then pass this list to the function to prepare the ABI call. This will be an assembler function. For a stack-based ABI (32 bits), it simply pushes the parameters to the stack. For a case-based ABI, it can prepare the values ​​of the registers and save them as local variables ( add sp,nnn ) and after all the parameters have been prepared (possibly using the registers needed for the call, therefore saving them first) , loads the registers (a series of mov instructions) and executes the call command.

+1
source

The Windows x86-64 calling convention is quite simple and allows you to write a wrapper function that knows nothing of types. Just load the first 32 bytes of the arguments into the registers and copy the rest onto the stack.


You definitely need to call a function call from asm ; It cannot work reliably to create a bunch of function calls, such as fill_register_xmm0 , and hope that the compiler will not hide any of these registers. The C compiler emits commands that use registers as part of its normal job, including passing arguments to functions like fill_register_xmm0 .

The only alternative would be to write a C statement with a function call with all arguments of the correct type to force the compiler to emit code for a normal function call. If there are only a few possible combinations of arguments, placement in if() blocks can be good.

And BTW, movsd xmm0, variable is probably going to movsd xmm0, xmm0 because the first arg function is passed to XMM0 if it's FP.


In C, prepare a buffer with arguments (as in the 32-bit case).

Each of them should be padded to 8 bytes, if it is already. See MS docs for x86-64 __fastcall . (Note that x86-64 __vectorcall skips __m128 args by value in registers, but for __fastcall it is strictly true that args form an array of 8-byte values ​​after registering the arguments. And storing them in the shadow space creates a full array of all arguments.)

Any argument that does not fit into 8 bytes or does not equal 1, 2, 4, or 8 bytes must be passed by reference. There is no attempt to extend one argument to multiple registers.

But the key thing that makes function variables easy in the Windows convention also works here: The register used for the second arg does not depend on the type of the first . that is, if FP arg is the first argument to arg, then it uses the integer register arg-pass slot. So you can have up to 4 args registers, not 4 integers and 4 FPs.

If the fourth arg is an integer, it goes to R9 , even if it is the first integer arg . Unlike the System V x86-64 calling convention, where the first integer arg is in rdi , no matter how many early arguments FP is in the register and / or on the stack.

Thus, the asm shell that calls this function can load the first 8 bytes into integer and FP registers ! (Variadic functions already require this, so the caller does not need to know whether to store an integer or FP register to form this arg array. MS optimized the calling convention for simplicity of variational call functions, due to the efficiency for functions with a combination of integers and arguments FP.)

Side C, which puts all args in the buffer, might look like this:

 #include <stdalign.h> int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...)); void somefunc() { alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment char *argp = (char*)argbuf; for( ; argp < &argbuf[256] ; argp += 8) { if (figure_out_an_arg()) { int foo = get_int_arg(); memcpy(argp, &foo, sizeof(foo)); } else if(bar) { double foo = get_double_arg(); memcpy(argp, &foo, sizeof(foo)); } else ... memcpy whatever size // or allocate space to pass by ref and memcpy a pointer } if (argp == &argbuf[256]) { // error, ran out of space for args } asmwrapper(argbuf, argp-argbuf, funcpointer); } 

Unfortunately, I do not think that we can directly use argbuf on the stack as the args + shadow space to call a function. We have no way to stop the compiler from placing something of value below argbuf , which will allow us to simply install rsp in its lower part (and save the return address somewhere, possibly in the upper part of argbuf , reserve some space for using asm).

Anyway, just copying the whole buffer will work. Or, in fact, load the first 32 bytes into registers (both integer and FP) and just copy the rest. Shadow space does not need to be initialized.

argbuf could be a VLA if you knew in advance how necessary it was, but 256 bytes are pretty small. It's not like reading at the end of this can be a problem, it can't be at the end of a page with unmarked memory later, because our frame stack of parent functions definitely takes up some space.

 ;; NASM syntax. For MASM just rename the local labels and add whatever PROC / ENDPROC is needed. ;; UNTESTED ;; rcx: argbuf ;; rdx: length in bytes of the args. 0..256, zero-extended to 64 bits ;; r8 : function pointer ;; reserve rdx bytes of space for arg passing ;; load first 32 bytes of argbuf into integer and FP arg-passing registers ;; copy the rest as stack-args above the shadow space global asmwrapper asmwrapper: push rbp mov rbp, rsp ; so we can efficiently restore the stack later mov r10, r8 ; move function pointer to a volatile but non-arg-passing register ; load *both* xmm0-3 and rcx,rdx,r8,r9 from the first 32 bytes of argbuf ; regardless of types or whether there were that many arg bytes ; All bytes are loaded into registers early, some reg->reg transfers are done later ; when we're done with more registers. ; movsd xmm0, [rcx] ; movsd xmm1, [rcx+8] movaps xmm0, [rcx] ; 16-byte alignment required for argbuf. Use movups to allow misalignment if you want movhlps xmm1, xmm0 ; use some ALU instructions instead of just loads ; rcx,rdx can't be set yet, still in use for wrapper args movaps xmm2, [rcx+16] ; it ok to leave garbage in the high 64-bits of an XMM passing a float or double. ;movhlps xmm3, xmm2 ; the copyloop uses xmm3: do this later movq r8, xmm2 mov r9, [rcx+24] mov eax, 32 cmp edx, eax jbe .small_args ; no copying needed, just shadow space sub rsp, rdx and rsp, -16 ; reserve extra space, realigning the stack by 16 ; rax=32 on entry, start copying just above shadow space (which doesn't need to be copied) .copyloop: ; do { movaps xmm3, [rcx+rax] movaps [rsp+rax], xmm3 ; indexed addressing modes aren't always optimal, but this loop only runs a couple times. add eax, 16 cmp eax, edx jb .copyloop ; } while(bytes_copied < arg_bytes); .done_arg_copying: ; xmm0,xmm1 have the first 2 qwords of args movq rcx, xmm0 ; RCX NO LONGER POINTS AT argbuf movq rdx, xmm1 ; xmm2 still has the 2nd 16 bytes of args ;movhlps xmm3, xmm2 ; don't use: false dependency on old value and we just used it. pshufd xmm3, xmm2, 0xee ; xmm3 = high 64 bits of xmm2. (0xee = _MM_SHUFFLE(3,2,3,2)) ; movq xmm3, r9 ; nah, can be multiple uops on AMD ; r8,r9 set earlier call r10 leave ; restore RSP to its value on entry ret ; could handle this branchlessly, but copy loop still needs to run zero times ; unless we bump up the min arg_bytes to 48 and sometimes copy an unnecessary 16 bytes ; As much work as possible is before the first branch, so it can happen while a mispredict recovers .small_args: sub rsp, rax ; reserve shadow space ;rsp still aligned by 16 after push rbp jmp .done_arg_copying ;byte count. This wrapper is 82 bytes; would be nice to fit it in 80 so we don't waste 14 bytes before the next function. ;eg maybe mov rcx, [rcx] instead of movq rcx, xmm0 ;mov eax, $-asmwrapper align 16 

It collects ( in Godbolt with NASM ), but I have not tested it.

It should work very well, but if you get incorrect predictions around trimming from <= 32 bytes to 32 bytes, change the branch so that it always copies an extra 16 bytes. (Uncomment cmp / cmovb on the Godbolt version, but the copy cycle should still start with 32 bytes in each buffer.)

If you often miss very few arguments, then 16-byte loads can get into the forwarding store from two narrow stores to one big reboot , which causes an additional 8 latency cycles. This is usually a bandwidth issue, but it can increase the delay before the called function can access its arguments. If out-of-order execution cannot hide it, then you should use more boot modules to load each 8-byte argument separately. (Especially in integer registers, and then from there to XMM, if the arguments are mostly integer, this will have a lower delay than mem β†’ xmm β†’ integer.)

If you have more than a couple of arguments, we hope that the first few take on L1d and no longer need to redirect the storage by the time the asm shell starts. Or just copy the later arguments that the first 2 args end their load + ALU chain early enough so as not to delay the critical path inside the called function.

Of course, if performance was a huge problem, you should write code that computes the arguments in asm, so you don’t need these materials to copy or use the library interface with a fixed function signature that the C compiler can call directly. I tried to do this as little as possible on modern Intel / AMD processors ( http://agner.org/optimize/ ), but I did not test it or configure it, so perhaps it could be improved with some time spent on its profiling, especially for some real-world use cases.

If you know that the FP arguments are not possible for the first 4, you can simplify it by simply loading the integer registers.

+1
source

Source: https://habr.com/ru/post/983696/


All Articles