Clicking on floating point registers instead of stack

I have a function that should be as fast as possible, and uses only whole operations. It works on AMD64 architecture, and I need to do some push / pops to have enough registers to work. Now I wonder, x64 ABI claims that the first four floating point registers (XMM0, XMM1, XMM2 and XMM3) are volatile and do not need to save function calls.

So, I decided that I can store 64-bit registers, which I need to save in the lower 64 bits of these registers (i.e. MM0, MM1, ...) via movq (MMX or SSE instruction set) instead of using the stack, saving with multiple downloads / storage storages. In addition, I would not need to maintain the state of the FPU using EMMS - in order to defeat the goal - since I do not actually manipulate with floating point registers, but use them only as storage (and, in any case, the x87 block is almost never used under x64 as it is essentially replaced by SSE)

I made a modification and it works (without glitches and the observed 4% increase in productivity), but I am wondering if this “hack” really works or it will present any specific side effects that I might have missed (like state corruption FPU, although I do not use it, something like that). And will loading / storing into the FPU register always be faster than loading / storing memory in any current architecture?

And yes, this optimization is really needed. And frankly, this is not something that would seriously impair the cost of maintaining the code; a single-line comment would be enough to explain the trick. So, if I can get a couple less hours per byte for free without any unforeseen consequences, I will gladly take them :)

Thanks.

+6
source share
1 answer

The EMMS instruction is only needed to clear the state after MMX operations. SSE instructions do not require this. So, of course, this will not conflict.

Of course, you should keep in mind that different compilers and operating systems use different calling conventions, and some may handle these four registers differently.

However, given this, I do not see a problem with this approach. You use all registers as they should be used in accordance with ABI.

And if we assume that this is written in the assembly, there is no need to consider whether this can interfere with compiler optimization (the C / C ++ function, which immerses in ASM and starts talking about specific registers, makes the compiler much more difficult to optimize the code)

+3
source

Source: https://habr.com/ru/post/920489/


All Articles