Convert Int to Float / Float to Int using Bitwise

I was wondering if the process of converting an integer to a float or float to an integer can be explained. For my class, we should do this using only bitwise operators, but I think a solid understanding of type-to-type casting will help me more at this point.

From what I know so far, for int for float, you will have to convert the integer to binary, normalize the integer value by finding the value, exponent and fraction, and then output the value to float from there

As for float to int, will you need to separate this value to a value, metric and fraction, and then cancel the above instructions to get the int value?

I tried following the instructions on this: Listing float for int (bitwise) in C

But I could not understand it.

Also, can anyone explain why rounding would be necessary for values ​​greater than 23 bits when converting int to float?

Thank you in advance

+6
source share
3 answers

Firstly, a document that you should consider if you want to better understand the disadvantages of floating point: "What every computer scientist needs to know about floating point arithmetic," http://www.validlab.com/goldberg/paper.pdf

And now for some meat.

The following code is bare bones and is trying to create an IEEE-754 single point float with an unsigned int in the range 0 <value <2 24 . This is the format you are likely to come across on modern hardware, and this is the format that you seem to refer to in your original question.

IEEE-754 single-point floats are divided into three fields: one sign bit, 8 exponent bits, and 23 significance bits (sometimes called the mantissa). IEEE-754 uses a hidden value of 1, meaning that the value is actually 24 bits. The bits are packed from left to right, with the sign of bits in bit 31, the exponent in bits 30 .. 23, and the value in bits 22 .. 0. The following diagram from Wikipedia illustrates:

floating point format

The metric has an offset of 127, which means that the actual metric associated with the floating point number is 127 less than the value stored in the exponent field. Thus, exponent 0 will be encoded as 127.

(Note: the full Wikipedia article may be of interest to you. Link: http://en.wikipedia.org/wiki/Single_precision_floating-point_format )

Therefore, IEEE-754 0x40000000 is interpreted as follows:

  • Bit 31 = 0: Positive value
  • Bits 30 .. 23 = 0x80: Exponent = 128 - 127 = 1 (aka. 2 1 )
  • Bits 22 .. 0 - all 0: Sign = 1.00000000_00000000_0000000. (Note: I restored hidden 1).

So the value is 1.0 x 2 1 = 2.0.

To convert unsigned int to the limited range specified above and then to the IEEE-754 format, you can use a function like the one below. It performs the following steps:

  • Aligns the leading 1 of an integer with the position of hidden 1 in the floating point view.
  • When aligning an integer, the total number of shifts made is recorded.
  • Disables the hidden file.
  • Using the number of shifts made, it calculates the exponent and adds it to the number.
  • Using reinterpret_cast , converts the resulting bit pattern into a float . This part is an ugly hack because it uses a type pointer. You can also do this by abusing union . Some platforms provide internal work (e.g. _itof ) to make this reinterpretation less ugly.

There are much faster ways to do this; this one is designed to be pedagogically useful, if not super effective:

 float uint_to_float(unsigned int significand) { // Only support 0 < significand < 1 << 24. if (significand == 0 || significand >= 1 << 24) return -1.0; // or abort(); or whatever you'd like here. int shifts = 0; // Align the leading 1 of the significand to the hidden-1 // position. Count the number of shifts required. while ((significand & (1 << 23)) == 0) { significand <<= 1; shifts++; } // The number 1.0 has an exponent of 0, and would need to be // shifted left 23 times. The number 2.0, however, has an // exponent of 1 and needs to be shifted left only 22 times. // Therefore, the exponent should be (23 - shifts). IEEE-754 // format requires a bias of 127, though, so the exponent field // is given by the following expression: unsigned int exponent = 127 + 23 - shifts; // Now merge significand and exponent. Be sure to strip away // the hidden 1 in the significand. unsigned int merged = (exponent << 23) | (significand & 0x7FFFFF); // Reinterpret as a float and return. This is an evil hack. return *reinterpret_cast< float* >( &merged ); } 

You can make this process more efficient by using functions that discover the leading 1 in the number. (They are sometimes called names like clz for "counting leading zeros" or norm for "normalizing".)

You can also expand this to signed numbers by writing a character, taking the absolute value of an integer, following the steps above, and then putting the character in bit 31 of the number.

For integers> = 2 24, the integer does not fit into the field value of the 32-bit float. That's why you need to get around: you lose the LSB to make the value fit. Thus, multiple integers will be mapped to the same floating point pattern. The exact display depends on the rounding mode (rounded to -Inf, rounded to + Inf, rounded to zero, rounded to the nearest even). But the fact is that you cannot push 24 bits into less than 24 bits without any loss.

You can see this from the point of view of the code above. It works by aligning the leading 1 to the hidden position 1. If the value was> = 2 24, the code would have to shift to the right, and not to the left, and this necessarily shifts the LSB. Rounding modes just tell you how to handle bits that are offset.

+12
source

Have you checked the IEEE 754 floating point representation?

In a 32-bit normalized form, it has a (sign bit) mantissa, an 8-bit exponent (excess-127, I think), and a 23-bit mantissa in decimal, except that it is "0." (always in this form), and the radius is 2, not 10. That is: the MSB value is 1/2, the next bit is 1/4, etc.

+2
source

The answer to Joe Z is elegant, but the range of input values ​​is very limited. A 32-bit float can store all integer values ​​from the following range:

[- 2 24 ... + 2 24 ] = [-16777216 ... + 16777216]

and some other values ​​outside this range.

The entire range will be covered as follows:

 float int2float(int value) { // handles all values from [-2^24...2^24] // outside this range only some integers may be represented exactly // this method will use truncation 'rounding mode' during conversion // we can safely reinterpret it as 0.0 if (value == 0) return 0.0; if (value == (1U<<31)) // ie -2^31 { // -(-2^31) = -2^31 so we'll not be able to handle it below - use const value = 0xCF000000; return *((float*)&value); } int sign = 0; // handle negative values if (value < 0) { sign = 1U << 31; value = -value; } // although right shift of signed is undefined - all compilers (that I know) do // arithmetic shift (copies sign into MSB) is what I prefer here // hence using unsigned abs_value_copy for shift unsigned int abs_value_copy = value; // find leading one int bit_num = 31; int shift_count = 0; for(; bit_num > 0; bit_num--) { if (abs_value_copy & (1U<<bit_num)) { if (bit_num >= 23) { // need to shift right shift_count = bit_num - 23; abs_value_copy >>= shift_count; } else { // need to shift left shift_count = 23 - bit_num; abs_value_copy <<= shift_count; } break; } } // exponent is biased by 127 int exp = bit_num + 127; // clear leading 1 (bit #23) (it will implicitly be there but not stored) int coeff = abs_value_copy & ~(1<<23); // move exp to the right place exp <<= 23; int ret = sign | exp | coeff; return *((float*)&ret); } 

Of course, there are other ways to find the value of abs (intlessless). Similarly, couting leading zeros can also be performed without a branch, so consider this example as an example :-).

+1
source

Source: https://habr.com/ru/post/959086/


All Articles