I have a function that should be as fast as possible, and uses only whole operations. It works on AMD64 architecture, and I need to do some push / pops to have enough registers to work. Now I wonder, x64 ABI claims that the first four floating point registers (XMM0, XMM1, XMM2 and XMM3) are volatile and do not need to save function calls.
So, I decided that I can store 64-bit registers, which I need to save in the lower 64 bits of these registers (i.e. MM0, MM1, ...) via movq (MMX or SSE instruction set) instead of using the stack, saving with multiple downloads / storage storages. In addition, I would not need to maintain the state of the FPU using EMMS - in order to defeat the goal - since I do not actually manipulate with floating point registers, but use them only as storage (and, in any case, the x87 block is almost never used under x64 as it is essentially replaced by SSE)
I made a modification and it works (without glitches and the observed 4% increase in productivity), but I am wondering if this “hack” really works or it will present any specific side effects that I might have missed (like state corruption FPU, although I do not use it, something like that). And will loading / storing into the FPU register always be faster than loading / storing memory in any current architecture?
And yes, this optimization is really needed. And frankly, this is not something that would seriously impair the cost of maintaining the code; a single-line comment would be enough to explain the trick. So, if I can get a couple less hours per byte for free without any unforeseen consequences, I will gladly take them :)
Thanks.
source share