GCC generated FPU operations during integer casting for float

Question

GCC generated FPU operations during integer casting for float

I want to do the division by FPU in C (using integer values):

float foo; uint32_t *ptr1, *ptr2; foo = (float)*(ptr1) / (float)*(ptr2);

And in NASM (from an object compiled through GCC), it has the following representation:

  mov rax, QWORD [ptr1] mov eax, DWORD [rax] mov eax, eax test rax, rax js ?_001 pxor xmm0, xmm0 cvtsi2ss xmm0, rax jmp ?_002 ?_001: mov rdx, rax shr rdx, 1 and eax, 01H or rdx, rax pxor xmm0, xmm0 cvtsi2ss xmm0, rdx addss xmm0, xmm0 ?_002: mov rax, QWORD [ptr2] ; ... for ptr2 pattern repeats

What does this "black magic" mean under? _001? Is not only cvtsi2ss sufficient to convert from an integer to a float?

+5

c assembly gcc x86-64 fpu

saleph Jan 05 '17 at 10:02

source share

2 answers

You should look at unoptimized code. It's a waste of time. When the optimizer is disabled, compilers generate a bunch of pointless code for various reasons - to achieve faster compilation speed, to make it easier to set breakpoints on source lines, to make it easier to catch errors, etc.

When you create optimized code in targeting the x86-64 compiler, all this noise goes away, the code becomes much more efficient and therefore much easier to interpret / understand.

Here is the function that performs the required operation. I wrote this as a function, so that I can pass inputs as opaque parameters, and the compiler cannot optimize it.

 float DivideAsFloat(uint32_t *ptr1, uint32_t *ptr2) { return (float)(*ptr1) / (float)(*ptr2); }

Here is the object code that generates all versions of GCC for this version (back to 4.9.0):

 DivideAsFloat(unsigned int*, unsigned int*): mov eax, DWORD PTR [rdi] ; retrieve value of 'ptr1' parameter pxor xmm0, xmm0 ; zero-out xmm0 register pxor xmm1, xmm1 ; zero-out xmm1 register cvtsi2ssq xmm0, rax ; convert *ptr1 into a floating-point value in XMM0 mov eax, DWORD PTR [rsi] ; retrieve value of 'ptr2' parameter cvtsi2ssq xmm1, rax ; convert *ptr2 into a floating-point value in XMM1 divss xmm0, xmm1 ; divide the two floating-point values ret

This is almost what you expect to see. The only "black magic" here is the PXOR instructions. Why is the compiler trying to enumerate XMM registers to zero before running the CVTSI2SS , which is just about to hide them? Good, because CVTSI2SS only partially captures its destination register. In particular, it captures only the lower bits, leaving the upper bits intact. This leads to a false dependence on the upper bits, which leads to the execution of stalls. This dependence can be broken by pre-zeroing the register, which prevents the possibility of stalls and speeds up execution. The PXOR instruction is a quick and efficient way to clear a register. (I recently talked about this exact same phenomenon here - see the last paragraph .)

In fact, older versions of GCC (prior to 4.9.0) did not perform this optimization and, therefore, generated code that did not include PXOR instructions. It looks more efficient, but it works slower.

 DivideAsFloat(unsigned int*, unsigned int*): mov eax, DWORD PTR [rdi] ; retrieve value of 'ptr1' parameter cvtsi2ssq xmm0, rax ; convert *ptr1 into a floating-point value in XMM0 mov eax, DWORD PTR [rsi] ; retrieve value of 'ptr2' parameter cvtsi2ssq xmm1, rax ; convert *ptr2 into a floating-point value in XMM1 divss xmm0, xmm1 ; divide the two floating-point values ret

Clang 3.9 emits the same code as previous versions of GCC. He also does not know about optimization. MSVC knows about this (since VS 2010), as well as modern versions of ICC (tested on ICC 16 and later, not available in ICC 13).

However, not to say that Anty's answer (and the Mystical commentary ) is completely wrong. CVTSI2SS really designed to convert a signed integer to a scalar float with one precision, and not an unsigned integer, as you have here. So what gives? Well, a 64-bit processor has 64-bit registers available, so 32-bit unsigned input values can be saved as signed 64-bit intermediate values, which allows you to use CVTSI2SS in the end.

Compilers do this when optimization is turned on because it leads to more efficient code. If, on the other hand, you were guided by a 32-bit x86 and did not have 64-bit registers available, the compiler would have to deal with a signed unsigned problem. Here's how GCC 6.3 deals with this:

 DivideAsFloat(unsigned int*, unsigned int*): sub esp, 4 pxor xmm0, xmm0 mov eax, DWORD PTR [esp+8] pxor xmm1, xmm1 movss xmm3, 1199570944 pxor xmm2, xmm2 mov eax, DWORD PTR [eax] movzx edx, ax shr eax, 16 cvtsi2ss xmm0, eax mov eax, DWORD PTR [esp+12] cvtsi2ss xmm1, edx mov eax, DWORD PTR [eax] movzx edx, ax shr eax, 16 cvtsi2ss xmm2, edx mulss xmm0, xmm3 addss xmm0, xmm1 pxor xmm1, xmm1 cvtsi2ss xmm1, eax mulss xmm1, xmm3 addss xmm1, xmm2 divss xmm0, xmm1 movss DWORD PTR [esp], xmm0 fld DWORD PTR [esp] add esp, 4 ret

This is a little hard to understand due to the way optimizers rearrange and alternate instructions. Here I "non-optimized" it, reordering instructions and breaking them down into more logical groups, in the hope of making it easier to track the flow of execution. (The only instruction I deleted was breaking the PXOR dependency — the rest of the code is the same, just rebuilt.)

 DivideAsFloat(unsigned int*, unsigned int*): ;;; Initialization ;;; sub esp, 4 ; reserve 4 bytes on the stack pxor xmm0, xmm0 ; zero-out XMM0 pxor xmm1, xmm1 ; zero-out XMM1 pxor xmm2, xmm2 ; zero-out XMM2 movss xmm3, 1199570944 ; load a constant into XMM3 ;;; Deal with the first value ('ptr1') ;;; mov eax, DWORD PTR [esp+8] ; get the pointer specified in 'ptr1' mov eax, DWORD PTR [eax] ; dereference the pointer specified by 'ptr1' movzx edx, ax ; put the lower 16 bits of *ptr1 in EDX shr eax, 16 ; move the upper 16 bits of *ptr1 down to the lower 16 bits in EAX cvtsi2ss xmm0, eax ; convert the upper 16 bits of *ptr1 to a float cvtsi2ss xmm1, edx ; convert the lower 16 bits of *ptr1 (now in EDX) to a float mulss xmm0, xmm3 ; multiply FP-representation of upper 16 bits of *ptr1 by magic number addss xmm0, xmm1 ; add the result to the FP-representation of *ptr1 lower 16 bits ;;; Deal with the second value ('ptr2') ;;; mov eax, DWORD PTR [esp+12] ; get the pointer specified in 'ptr2' mov eax, DWORD PTR [eax] ; dereference the pointer specified by 'ptr2' movzx edx, ax ; put the lower 16 bits of *ptr2 in EDX shr eax, 16 ; move the upper 16 bits of *ptr2 down to the lower 16 bits in EAX cvtsi2ss xmm2, edx ; convert the lower 16 bits of *ptr2 (now in EDX) to a float cvtsi2ss xmm1, eax ; convert the upper 16 bits of *ptr2 to a float mulss xmm1, xmm3 ; multiply FP-representation of upper 16 bits of *ptr2 by magic number addss xmm1, xmm2 ; add the result to the FP-representation of *ptr2 lower 16 bits ;;; Do the division, and return the result ;;; divss xmm0, xmm1 ; FINALLY, divide the FP-representation of *ptr1 by *ptr2 movss DWORD PTR [esp], xmm0 ; store this result onto the stack, in the memory we reserved fld DWORD PTR [esp] ; load this result onto the top of the x87 FPU ; (the 32-bit calling convention requires floating-point values be returned this way) add esp, 4 ; clean up the space we allocated on the stack ret

Note that the strategy here is to split each of the 32-bit unsigned integer values into its two 16-bit halves. The upper half is converted to a floating point representation and multiplied by the magic number (to compensate for the subscription). Then the lower half is converted to a floating point representation, and these two floating point representations (each 16-bit half of the original 32-bit value) are added together. This is done twice - once for each 32-bit input value (see Two "groups" of instructions). Then, finally, the resulting two floating point representations are divided, and the result is returned.

The logic is similar to what non-optimized code does, but & hellip; well, more optimal. In particular, redundant instructions are deleted, and the algorithm is generalized so that branching into the signature is not necessary. This speeds up work because incorrectly predicted branches are slow.

Note that Clang uses a slightly different strategy and is able to generate even more optimal code here than GCC:

 DivideAsFloat(unsigned int*, unsigned int*): push eax ; reserve 4 bytes on the stack mov eax, DWORD PTR [esp+12] ; get the pointer specified in 'ptr2' mov ecx, DWORD PTR [esp+8] ; get the pointer specified in 'ptr1' movsd xmm1, QWORD PTR 4841369599423283200 ; load a constant into XMM1 movd xmm0, DWORD PTR [ecx] ; dereference the pointer specified by 'ptr1', ; and load the bits directly into XMM0 movd xmm2, DWORD PTR [eax] ; dereference the pointer specified by 'ptr2' ; and load the bits directly into XMM2 orpd xmm0, xmm1 ; bitwise-OR *ptr1 raw bits with the magic number orpd xmm2, xmm1 ; bitwise-OR *ptr2 raw bits with the magic number subsd xmm0, xmm1 ; subtract the magic number from the result of the OR subsd xmm2, xmm1 ; subtract the magic number from the result of the OR cvtsd2ss xmm0, xmm0 ; convert *ptr1 from single-precision to double-precision in place xorps xmm1, xmm1 ; zero register to break dependencies cvtsd2ss xmm1, xmm2 ; convert *ptr2 from single-precision to double-precision, putting result in XMM1 divss xmm0, xmm1 ; FINALLY, do the division on the single-precision FP values movss DWORD PTR [esp], xmm0 ; store this result onto the stack, in the memory we reserved fld DWORD PTR [esp] ; load this result onto the top of the x87 FPU ; (the 32-bit calling convention requires floating-point values be returned this way) pop eax ; clean up the space we allocated on the stack ret

It does not even use the CVTSI2SS instruction! Instead, it loads integer bits and manipulates them with some magic bit so that it can be considered as a double-precision floating-point value. A few more twiddling bits, it uses CVTSD2SS to convert each of these double-precision floating-point values to single-precision floating-point values. Finally, it divides the two floating point values with single precision and arranges for the return of the value.

Thus, when setting up 32-bit compilers, you have to deal with the difference between signed and unsigned integers, but they do it differently using different strategies - some probably more optimal than others. And that is why the search for optimized code is much more enlightened, in addition to the fact that this is what will actually be performed on your client machine.

+8

Cody gray Jan 6 '17 at 7:56

source share

Anty · Accepted Answer · 2017-01-05T22:57:51+0000

In general, cvtsi2ss does the trick - converts a scalar integer (other sources call it a double integer for a single scalar, but my naming corresponds to other vectors) for a scalar single (float). But he expects a signed integer.

So this code

 mov rdx, rax shr rdx, 1 and eax, 01H or rdx, rax pxor xmm0, xmm0 cvtsi2ss xmm0, rdx addss xmm0, xmm0

help convert unsigned to signed (pay attention to js jump - if the bit of the sign is set, this code is executed - otherwise it is skipped). The sign is set when the value is greater than 0x7FFFFFFF for uint32_t.

So, the "magic" code does:

 mov rdx, rax ; move value from ptr1 to edx shr rdx, 1 ; div by 2 - logic shift not arithmetic because ptr1 is unsigned and eax, 01H ; save least significant bit or rdx, rax ; move this bit to divided value to someway fix rounding errors pxor xmm0, xmm0 cvtsi2ss xmm0, rdx addss xmm0, xmm0 ; add to itself = multiply by 2

I'm not sure which compiler and which compilation options you use - GCC makes it easy

 cvtsi2ssq xmm0, rbx cvtsi2ssq xmm1, rax divss xmm0, xmm1

Hope this helps.

GCC generated FPU operations during integer casting for float

More articles: