Firstly, a document that you should consider if you want to better understand the disadvantages of floating point: "What every computer scientist needs to know about floating point arithmetic," http://www.validlab.com/goldberg/paper.pdf
And now for some meat.
The following code is bare bones and is trying to create an IEEE-754 single point float with an unsigned int in the range 0 <value <2 24 . This is the format you are likely to come across on modern hardware, and this is the format that you seem to refer to in your original question.
IEEE-754 single-point floats are divided into three fields: one sign bit, 8 exponent bits, and 23 significance bits (sometimes called the mantissa). IEEE-754 uses a hidden value of 1, meaning that the value is actually 24 bits. The bits are packed from left to right, with the sign of bits in bit 31, the exponent in bits 30 .. 23, and the value in bits 22 .. 0. The following diagram from Wikipedia illustrates:

The metric has an offset of 127, which means that the actual metric associated with the floating point number is 127 less than the value stored in the exponent field. Thus, exponent 0 will be encoded as 127.
(Note: the full Wikipedia article may be of interest to you. Link: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )
Therefore, IEEE-754 0x40000000 is interpreted as follows:
- Bit 31 = 0: Positive value
- Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 2 1 )
- Bits 22 .. 0 - all 0: Sign = 1.00000000_00000000_0000000. (Note: I restored hidden 1).
So the value is 1.0 x 2 1 = 2.0.
To convert unsigned int to the limited range specified above and then to the IEEE-754 format, you can use a function like the one below. It performs the following steps:
- Aligns the leading 1 of an integer with the position of hidden 1 in the floating point view.
- When aligning an integer, the total number of shifts made is recorded.
- Disables the hidden file.
- Using the number of shifts made, it calculates the exponent and adds it to the number.
- Using
reinterpret_cast , converts the resulting bit pattern into a float . This part is an ugly hack because it uses a type pointer. You can also do this by abusing union . Some platforms provide internal work (e.g. _itof ) to make this reinterpretation less ugly.
There are much faster ways to do this; this one is designed to be pedagogically useful, if not super effective:
float uint_to_float(unsigned int significand) {
You can make this process more efficient by using functions that discover the leading 1 in the number. (They are sometimes called names like clz for "counting leading zeros" or norm for "normalizing".)
You can also expand this to signed numbers by writing a character, taking the absolute value of an integer, following the steps above, and then putting the character in bit 31 of the number.
For integers> = 2 24, the integer does not fit into the field value of the 32-bit float. That's why you need to get around: you lose the LSB to make the value fit. Thus, multiple integers will be mapped to the same floating point pattern. The exact display depends on the rounding mode (rounded to -Inf, rounded to + Inf, rounded to zero, rounded to the nearest even). But the fact is that you cannot push 24 bits into less than 24 bits without any loss.
You can see this from the point of view of the code above. It works by aligning the leading 1 to the hidden position 1. If the value was> = 2 24, the code would have to shift to the right, and not to the left, and this necessarily shifts the LSB. Rounding modes just tell you how to handle bits that are offset.