The Windows x86-64 calling convention is quite simple and allows you to write a wrapper function that knows nothing of types. Just load the first 32 bytes of the arguments into the registers and copy the rest onto the stack.
You definitely need to call a function call from asm ; It cannot work reliably to create a bunch of function calls, such as fill_register_xmm0 , and hope that the compiler will not hide any of these registers. The C compiler emits commands that use registers as part of its normal job, including passing arguments to functions like fill_register_xmm0 .
The only alternative would be to write a C statement with a function call with all arguments of the correct type to force the compiler to emit code for a normal function call. If there are only a few possible combinations of arguments, placement in if() blocks can be good.
And BTW, movsd xmm0, variable is probably going to movsd xmm0, xmm0 because the first arg function is passed to XMM0 if it's FP.
In C, prepare a buffer with arguments (as in the 32-bit case).
Each of them should be padded to 8 bytes, if it is already. See MS docs for x86-64 __fastcall . (Note that x86-64 __vectorcall skips __m128 args by value in registers, but for __fastcall it is strictly true that args form an array of 8-byte values ββafter registering the arguments. And storing them in the shadow space creates a full array of all arguments.)
Any argument that does not fit into 8 bytes or does not equal 1, 2, 4, or 8 bytes must be passed by reference. There is no attempt to extend one argument to multiple registers.
But the key thing that makes function variables easy in the Windows convention also works here: The register used for the second arg does not depend on the type of the first . that is, if FP arg is the first argument to arg, then it uses the integer register arg-pass slot. So you can have up to 4 args registers, not 4 integers and 4 FPs.
If the fourth arg is an integer, it goes to R9 , even if it is the first integer arg . Unlike the System V x86-64 calling convention, where the first integer arg is in rdi , no matter how many early arguments FP is in the register and / or on the stack.
Thus, the asm shell that calls this function can load the first 8 bytes into integer and FP registers ! (Variadic functions already require this, so the caller does not need to know whether to store an integer or FP register to form this arg array. MS optimized the calling convention for simplicity of variational call functions, due to the efficiency for functions with a combination of integers and arguments FP.)
Side C, which puts all args in the buffer, might look like this:
#include <stdalign.h> int asmwrapper(const char *argbuf, size_t argp-argbuf, void (*funcpointer)(...)); void somefunc() { alignas(16) uint64_t argbuf[256/8]; // or char argbuf[256]. But if you choose not to use alignas, then uint64_t will still give 8-byte alignment char *argp = (char*)argbuf; for( ; argp < &argbuf[256] ; argp += 8) { if (figure_out_an_arg()) { int foo = get_int_arg(); memcpy(argp, &foo, sizeof(foo)); } else if(bar) { double foo = get_double_arg(); memcpy(argp, &foo, sizeof(foo)); } else ... memcpy whatever size // or allocate space to pass by ref and memcpy a pointer } if (argp == &argbuf[256]) { // error, ran out of space for args } asmwrapper(argbuf, argp-argbuf, funcpointer); }
Unfortunately, I do not think that we can directly use argbuf on the stack as the args + shadow space to call a function. We have no way to stop the compiler from placing something of value below argbuf , which will allow us to simply install rsp in its lower part (and save the return address somewhere, possibly in the upper part of argbuf , reserve some space for using asm).
Anyway, just copying the whole buffer will work. Or, in fact, load the first 32 bytes into registers (both integer and FP) and just copy the rest. Shadow space does not need to be initialized.
argbuf could be a VLA if you knew in advance how necessary it was, but 256 bytes are pretty small. It's not like reading at the end of this can be a problem, it can't be at the end of a page with unmarked memory later, because our frame stack of parent functions definitely takes up some space.
It collects ( in Godbolt with NASM ), but I have not tested it.
It should work very well, but if you get incorrect predictions around trimming from <= 32 bytes to 32 bytes, change the branch so that it always copies an extra 16 bytes. (Uncomment cmp / cmovb on the Godbolt version, but the copy cycle should still start with 32 bytes in each buffer.)
If you often miss very few arguments, then 16-byte loads can get into the forwarding store from two narrow stores to one big reboot , which causes an additional 8 latency cycles. This is usually a bandwidth issue, but it can increase the delay before the called function can access its arguments. If out-of-order execution cannot hide it, then you should use more boot modules to load each 8-byte argument separately. (Especially in integer registers, and then from there to XMM, if the arguments are mostly integer, this will have a lower delay than mem β xmm β integer.)
If you have more than a couple of arguments, we hope that the first few take on L1d and no longer need to redirect the storage by the time the asm shell starts. Or just copy the later arguments that the first 2 args end their load + ALU chain early enough so as not to delay the critical path inside the called function.
Of course, if performance was a huge problem, you should write code that computes the arguments in asm, so you donβt need these materials to copy or use the library interface with a fixed function signature that the C compiler can call directly. I tried to do this as little as possible on modern Intel / AMD processors ( http://agner.org/optimize/ ), but I did not test it or configure it, so perhaps it could be improved with some time spent on its profiling, especially for some real-world use cases.
If you know that the FP arguments are not possible for the first 4, you can simplify it by simply loading the integer registers.