Integer conversion in floating point arithmetic

I am currently facing the following dilemma:

1.0f * INT32_MAX != INT32_MAX

Rating 1.0f * INT32_MAXreally gives meINT32_MIN

I'm not completely surprised at this; I know that floating point integer conversions are not always accurate.

What is the best way to solve this problem?

The code I'm writing scales an array of rational numbers: from -1.0f <= x <= 1.0ftoINT32_MIN <= x <= INT32_MAX

Here is what the code looks like:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        dst[i] = src[i] * INT32_MAX;
    }
}

Here I have finished:

void convert(int32_t * dst, const float * src, size_t count){
    size_t i = 0;
    for (i = 0; i < count; i++){
        double tmp = src[i];
        if (src[i] > 0.0f){
            tmp *= INT32_MAX;
        } else {
            tmp *= INT32_MIN;
            tmp *= -1.0;
        }
        dst[i] = tmp;
    }
}
+4
source share
1 answer

In IEEE754, 2147483647 cannot be represented in a single precision stream. A quick test shows that the result is 1.0f * INT32_MAXrounded to 2147483648.0f, which cannot be represented in int.

, int, , , 1 !

, double . 2147483647.0 , .

+6

Source: https://habr.com/ru/post/1622522/


All Articles