The float
type uses the same number of bits as int
(32 bits) to represent floating point numbers in a larger range than int
uses to represent only integers.
This leads to a loss of precision, since not every int
number can be accurately represented by the float
character. Only 24 bits are used to represent the partial part of the number (including the signed bit), while the remaining 8 are used to represent the exponent.
If you set this int
value to double
, then there will be no loss of precision, since double
has 64 bits, and more than 32 of them are used to represent the fraction.
Here is a more detailed explanation:
Binary representation of 123456789 as an int:
00000111 01011011 11001101 0001 0101
A single-precision floating-point number is constructed from 32 bits using the following formula :
(-1)^sign * 1.b22 b21 ... b0 * 2^(e-127)
Where sign
is the most significant bit (b31). b22 - b0 - bit bits, and bits b30 - b23 - exponent e.
Therefore, when you convert int
123456789 to float
, you can only use the following 25 bits:
00000111 01011011 11001101 00010101 - --- -------- -------- -----
We can safely get rid of any leading zeros (except the sign bit) and any trailing zeros. This gives you the 3 least significant bits that we need to reset. We can either subtract 5 to get 123456784:
00000111 01011011 11001101 00010000 - --- -------- -------- -----
or add 3 to get 123456792:
00000111 01011011 11001101 00011000 - --- -------- -------- -----
Obviously, Appendix 3 gives a better approximation.