Change sign when moving from int to float and back

Question

Change sign when moving from int to float and back

Consider the following code, which is SSCCE my actual problem:

#include <iostream> int roundtrip(int x) { return int(float(x)); } int main() { int a = 2147483583; int b = 2147483584; std::cout << a << " -> " << roundtrip(a) << '\n'; std::cout << b << " -> " << roundtrip(b) << '\n'; }

The output on my computer (Xubuntu 12.04.3 LTS):

 2147483583 -> 2147483520 2147483584 -> -2147483648

Notice how a positive number b ends up negative after a circular motion. Is this behavior good? I would expect the int-to-float round-trip to at least save the sign correctly ...

Hm, on ideone , the result is different:

 2147483583 -> 2147483520 2147483584 -> 2147483647

Was the g ++ command fixed at the same time, or are both outputs perfect?

+42

c ++ floating-point type-conversion ieee-754 twos-complement

fredoverflow Dec 08 '13 at

source share

2 answers

Pascal's answer is fine - but there aren’t enough details that entail that some users don’t get it ;-). If you are interested in how it looks at a lower level (provided that the coprocessor, and not the software, processes floating-point operations), read on.

In 32 bits of float (IEEE 754) you can store all integers from the [- 2 ²⁴ ... 2 ²⁴ ] range. Out of range integers can also have an exact representation as a float, but not all of them have. The problem is that you can only have 24 significant bits to play float.

This is how conversion from int-> float usually looks low:

 fild dword ptr[your int] fstp dword ptr[your float]

This is just a sequence of two coprocessor instructions. First it loads 32 bits of int into the comprocessor stack and converts it into a float with a width of 80 bits.

Intel® 64 and IA-32 Software Developer Guide
(PROGRAMMING WITH X87 FPU):
When a floating-point number, integer, or packed BCD values are loaded from memory into any of the x87 FPU data registers, the values are automatically converted to double precision floating-point format with double precision (if they are not already in this format).

Since FPU registers have floats with a width of 80 bits, there is no problem with fild , since 32bit int fits perfectly into the 64-bit significance of the floating-point format.

So far so good.

The second part - fstp bit complicated and can be amazing. It is supposed to store 80-bit floating point in 32-bit navigation. Although we are talking about integer values (in the question), the coprocessor can actually perform rounding. Ke? How do you round an integer value, even if it is stored in floating point format ?; -.)

I will explain this in the near future - first, let's see what x87 rounding modes provide (they are the embodiment of the IEE 754 rounding modes). X87 fpu has 4 rounding modes controlled by bits # 10 and # 11 of the fpu control word:

00 - to the nearest even - the rounded result is close to the infinitely accurate result. If the two values are equally close, the result is an even value (i.e., one with the least significant digit of zero). Default
01 - in the direction of -Inf
10 - to the side + inf
11 - to 0 (i.e. truncation)

You can play in rounding mode with this simple code (although it can be done differently - a low level is displayed here):

 enum ROUNDING_MODE { RM_TO_NEAREST = 0x00, RM_TOWARD_MINF = 0x01, RM_TOWARD_PINF = 0x02, RM_TOWARD_ZERO = 0x03 // TRUNCATE }; void set_round_mode(enum ROUNDING_MODE rm) { short csw; short tmp = rm; _asm { push ax fstcw [csw] mov ax, [csw] and ax, ~(3<<10) shl [tmp], 10 or ax, tmp mov [csw], ax fldcw [csw] pop ax } }

Ok, but still, how does this relate to integer values? Patience ... to understand why you might need the rounding modes involved in the int to float conversion check, the most obvious way to convert int to float is truncation (not the default) - it might look like this:

Record mark
negates your int if less than zero
find the position of the left 1
shift right / left int so 1 found above is located on bit # 23
record the number of shifts during the process so you can count the exponent

And code that mimics this behavior might look like this:

 float int2float(int value) { // handles all values from [-2^24...2^24] // outside this range only some integers may be represented exactly // this method will use truncation 'rounding mode' during conversion // we can safely reinterpret it as 0.0 if (value == 0) return 0.0; if (value == (1U<<31)) // ie -2^31 { // -(-2^31) = -2^31 so we'll not be able to handle it below - use const value = 0xCF000000; return *((float*)&value); } int sign = 0; // handle negative values if (value < 0) { sign = 1U << 31; value = -value; } // although right shift of signed is undefined - all compilers (that I know) do // arithmetic shift (copies sign into MSB) is what I prefer here // hence using unsigned abs_value_copy for shift unsigned int abs_value_copy = value; // find leading one int bit_num = 31; int shift_count = 0; for(; bit_num > 0; bit_num--) { if (abs_value_copy & (1U<<bit_num)) { if (bit_num >= 23) { // need to shift right shift_count = bit_num - 23; abs_value_copy >>= shift_count; } else { // need to shift left shift_count = 23 - bit_num; abs_value_copy <<= shift_count; } break; } } // exponent is biased by 127 int exp = bit_num + 127; // clear leading 1 (bit #23) (it will implicitly be there but not stored) int coeff = abs_value_copy & ~(1<<23); // move exp to the right place exp <<= 23; int ret = sign | exp | coeff; return *((float*)&ret); }

Now an example - truncation mode converts 2147483583 to 2147483520 .

 2147483583 = 01111111_11111111_11111111_10111111

During the int-> float conversion, you must shift the left 1 to bit # 23. Now the leading 1 is at bit # 30. To put it in bit No. 23, you must shift right by 7 positions. During this, you lose (they will not fit into the 32-bit float format). 7 bits lsb on the right (you truncate / break). These were:

 01111111 = 63

And 63 is what lost the original number:

 2147483583 -> 2147483520 + 63

Truncation is easy, but it may not necessarily be what you want and / or best for all occasions. Consider the example below:

 67108871 = 00000100_00000000_00000000_00000111

The above value cannot be represented exactly with float, but check what truncation does. As before, we need to shift the left 1 by bit # 23. This requires that the value be shifted exactly 3 positions, losing 3 LSB bits (at the moment I will write numbers in different ways, showing where the implicit 24th bit is float and will copy the explicit 23 bits of the significant value):

 00000001.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

The truncation interrupts 3 lagging bits, leaving us with 67108864 (67108864 + 7 (3 discontinuous bits)) = 67108871 (remember, although we offset the compensation by exponential manipulation - omitted here).

Is this good enough? Hey 67108872 perfectly represented by a 32-bit float and should be much better than 67108864 to the right? CORRECT, and here you can talk about rounding when converting int to 32-bit float.

Now let's see how the rounding mode to the nearest even mode works, and what are its consequences in the case of OP. Consider the same example again.

 67108871 = 00000100_00000000_00000000_00000111

As we know, we need 3 shifts to the right to put the left 1 in bit # 23:

 00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

The "rounding to the nearest even" procedure involves searching for two numbers that enter the input value of the bracket 67108871 from below and above as close as possible. Keep in mind that we are still working in the 80-bit FPU, so although I show that some bits are shifted, they are still in the FPU mode, but will be deleted during the rounding operation while maintaining the output value.

 00000000_1.[0000000_00000000_00000000] 111 * 2^26 (3 bits shifted out)

2 values close to brackets 00000000_1.[0000000_00000000_00000000] 111 * 2^26 :

on top:

  00000000_1.[0000000_00000000_00000000] 111 * 2^26 +1 = 00000000_1.[0000000_00000000_00000001] * 2^26 = 67108872

and below:

  00000000_1.[0000000_00000000_00000000] * 2^26 = 67108864

Obviously, 67108872 much closer to 67108871 than 67108864 , so converting from a 32-bit value of int 67108871 gives 67108872 (rounding to the nearest even mode).

Now the OP numbers (still rounded to the nearest even):

  2147483583 = 01111111_11111111_11111111_10111111 = 00000000_1.[1111111_11111111_11111111] 0111111 * 2^30

brackets values:

top:

  00000000_1.[1111111_111111111_11111111] 0111111 * 2^30 +1 = 00000000_10.[0000000_00000000_00000000] * 2^30 = 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

bottom:

 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

Keep in mind that the word even in "rounding to the nearest even" has meaning only when the input value is halfway between the values of the bracket. Only then does the word even matter and “decide” which bracket value to choose. In the case above, it doesn’t even matter, and we just have to choose a closer value, which is 2147483520

The latter case of OP shows a problem in which the word even occurs.

  2147483584 = 01111111_11111111_11111111_11000000 = 00000000_1.[1111111_11111111_11111111] 1000000 * 2^30

parentheses are the same as before:

top: 00000000_1.[0000000_00000000_00000000] * 2^31 = 2147483648

bottom: 00000000_1.[1111111_111111111_11111111] * 2^30 = 2147483520

Now there is no closer value (2147483648-2147483584 = 64 = 2147483584-2147483520), so we must rely on even to choose the upper (even) value 2147483648 .

And here the problem of OP is what Pascal briefly described. FPU only works with signed values and 2147483648 cannot be saved as a signed int, since its maximum value is 2147483647, therefore, problems.

Simple proof (without documentation quotes) that the FPU only works with signed values, i.e. treats each value as signed, debugging this:

 unsigned int test = (1u << 31); _asm { fild [test] }

Although it seems that the test value should be treated as unsigned, it will be loaded as -2 ³¹ since there are no separate instructions for loading signed and unsigned values in the FPU. Likewise, you will not find instructions that allow you to store an unsigned value from the FPU in mem. Everything is just a bit, processed as signed, regardless of how you could declare it in your program.

It was a long time ago, but I hope that someone will learn some of this.

+10

Artur Dec 08 '13 at 15:22

source share

Pascal Cuoq · Accepted Answer · 2013-12-08 12:41

Your program causes undefined behavior due to overflow in floating point conversion to integer. What you see is just a common symptom on x86 processors.

The float value closest to 2147483584 exactly corresponds to 2 ³¹ (conversion from integers to floating points is usually rounded to the nearest, which can be up, in this case more specifically, the behavior when converting from integers to floating points is determined by the implementation, most implementations determine rounding as "according to the FPU rounding mode", and the default FPU rounding mode is rounded to the nearest).

Then, when converting from a float representing 2 ³¹ to int , an overflow occurs. This overflow is undefined. Some processors throw an exception, others are saturated. The IA-32 cvttsd2si , usually generated by compilers, always returns INT_MIN in case of an overflow, regardless of whether the float is positive or negative.

You should not rely on this behavior even if you know that you are targeting the Intel processor: when targeting on x86-64, compilers can emit to convert a floating point integer to a sequence of instructions that use undefined behavior to return results other than what you would expect from a target integer type .

Change sign when moving from int to float and back

More articles: