Overflow detection when converting integral values ​​to floating types

The C standard, which I know uses C ++ for these questions, has the following section:

When a value of an integer type is converted to a real floating type, if the converted value can be represented exactly in the new type, it does not change. If the converted value is in a range of values ​​that can be represented but cannot be represented accurately, the result will be either the closest higher or closest lower representable value selected in accordance with the implementation. If the converted value is outside the range of values ​​that can be represented, the behavior is undefined.

Is there any way to check the last case? It seems to me that this last undefined behavior is inevitable. If I have integral value i and naively check something like

 i <= FLT_MAX 

I will (in addition to other problems related to accuracy) already run it, because the comparison first converts i to float (in this case, or to any other floating type in general), so if it does not have a range, we get undefined behavior.

Or is there some kind of guarantee regarding the relative sizes of integral and floating types, which would imply something like "float can always represent all int values ​​(not necessarily exact)" or at least "long double can always hold everything", so that we can make comparisons in this type? I could not find anything like it.

This is basically a theoretical exercise, so I'm not interested in the answers, according to the fact that "on most architectures, these transformations always work." Try to find a way to detect such an overflow without assuming anything higher than the C (++) standard! :)

+5
source share
2 answers

Overflow detection when converting integral to floating types

FLT_MAX , DBL_MAX at least 1E + 37 on C spec, so all integers with | values ​​| of 122 bits or less is converted to float without overflow on all compatible platforms. Same thing with double


To solve this in the general case for integers 128/256 / etc. bit, both FLT_MAX and some_big_integer_MAX need to be reduced.

Perhaps taking a journal of both. ( bit_count() - TBD user code)

 if(bit_count(unsigned_big_integer_MAX) > logbf(FLT_MAX)) problem(); 

Or if the integer is not enough indentation

 if(sizeof(unsigned_big_integer_MAX)*CHAR_BIT > logbf(FLT_MAX)) problem(); 

Note: working with an FP function like logbf() can lead to an edge condition with exact integer math with the wrong comparison.


Macromagy can use stupid tests, such as the following, which take advantage of BIGINT_MAX , of course, splits the power of 2 minus 1 and FLT_MAX into power of 2, certainly accurate (if FLT_RADIX == 10 ).

This preprocessor code will complain if converting from a large integer type to a float is inaccurate for some large integer .

 #define POW2_61 0x2000000000000000u #if BIGINT_MAX/POW2_61 > POW2_61 // BIGINT is at least a 122 bit integer #define BIGINT_MAX_PLUS1_div_POW2_61 ((BIGINT_MAX/2 + 1)/(POW2_61/2)) #if BIGINT_MAX_PLUS1_div_POW2_61 > POW2_61 #warning TBD code for an integer wider than 183 bits #else _Static_assert(BIGINT_MAX_PLUS1_div_POW2_61 <= FLT_MAX/POW2_61, "bigint too big for float"); #endif #endif 

[Edit 2]

Is there any way to check the last case?

This code will complain if converting from a large integer to a float is inaccurate for selecting a large integer .

Of course, the test must occur before the conversion attempt.

Given various rounding modes or the rare FLT_RADIX == 10 , the best you can easily get is a test that is slightly lower. When this is true, the conversion will work. However, for a small range of large integers that report an error in the test below, convert OK.

Below is a more perfect idea that I need to change my mind a bit, but I hope that it will provide some coding idea for testing the OP is looking for.

 #define POW2_60 0x1000000000000000u #define POW2_62 0x4000000000000000u #define MAX_FLT_MIN 1e37 #define MAX_FLT_MIN_LOG2 (122 /* 122.911.. */) bool intmax_to_float_OK(intmax_t x) { #if INTMAX_MAX/POW2_60 < POW2_62 (void) x; return true; // All big integer values work #elif INTMAX_MAX/POW2_60/POW2_60 < POW2_62 return x/POW2_60 < (FLT_MAX/POW2_60) #elif INTMAX_MAX/POW2_60/POW2_60/POW2_60 < POW2_62 return x/POW2_60/POW2_60 < (FLT_MAX/POW2_60/POW2_60) #else #error TBD code #endif } 
+4
source

Here we use the C ++ template function, which returns the largest positive integer that fits into both data types.

 template<typename float_type, typename int_type> int_type max_convertible() { static const int int_bits = sizeof(int_type) * CHAR_BIT - std::is_signed<int_type>() ? 1 : 0; if ((int)ceil(std::log2(std::numeric_limits<float_type>::max())) > int_bits) return std::numeric_limits<int_type>::max(); return (int_type) std::numeric_limits<float_type>::max(); } 

If the number you are converting is greater than the return from this function, it cannot be converted. Unfortunately, it is difficult for me to find a combination of types for testing, it is very difficult to find an integer type that will not fit into the smallest floating point type.

+1
source

Source: https://habr.com/ru/post/1271307/


All Articles