Why are two variables of type float different?

I have two integer vectors about 1000 in size, and I'm going to check if the sum of a square integer for these two vectors is the same or not. Therefore, I write the following codes:

std::vector<int> array1; std::vector<int> array2; ... // initialize array1 and array2, and in the experiment all elements // in the two vectors are the same but the sequence of elements may be different. // For example: array1={1001, 2002, 3003, ....} // array2={2002, 3003, 1001, ....} assert(array1.size() == array2.size()); float sum_array1 = 0; float sum_array2 = 0; for(int i=0; i<array1.size(); i++) sum_array1 +=array1[i]*array1[i]; for(int i=0; i<array2.size(); i++) sum_array2 +=array2[i]*array2[i]; 

I expect that sum_array1 should be equal to sum_array2 , but actually in my application I found that they were different sum_array1 = 1.2868639e+009 , and sum_array2 = 1.2868655e+009 . What I did next was to change the type of sum_array1 and sum_array2 to a double type, as the following codes show:

  double sum_array1 = 0; double sum_array2 = 0; for(int i=0; i<array1.size(); i++) sum_array1 +=array1[i]*array1[i]; for(int i=0; i<array2.size(); i++) sum_array2 +=array2[i]*array2[i]; 

This time sum_array1 is equal to sum_array2 sum_array1=sum_array2=1286862225.0000000 . My question is why this could happen. Thank you

+4
source share
4 answers

Floating-point values ​​are finite in size and therefore can only represent real values ​​with finite precision. This leads to rounding errors when you need higher precision than they can store.

In particular, when adding a small number (for example, summing) to a much larger number (for example, your battery), the accuracy loss can be quite large compared to a small number, which gives a significant error; and errors will vary by order.

Typically, a float has 24 bits of precision, which corresponds to about 7 decimal places. Your battery needs 10 decimal places (about 30 bits), so you will lose this accuracy. As a rule, double has 53 bits (about 16 decimal places), so your result can be represented exactly.

A 64-bit integer might be the best option here, since all inputs are integers. Using the whole avoids the loss of accuracy, but poses a danger of overflow if the inputs are too large or too large.

To minimize error, if you cannot use a sufficiently large drive, you can sort the input so that the smallest values ​​are accumulated first; or you can use more sophisticated methods, such as summing Kahan .

+4
source

In two cycles, you add the same numbers, but in different orders. Once the amounts exceed an integer value that can be exactly represented by a float , you will begin to lose accuracy, and the amounts may vary slightly.

Experiment for you:

 float n = 0; while (n != n + 1) n = n + 1; //Will this terminate? If so, what is n now? 

If you run this, you will find that the loop actually completes - which seems completely counter-intuitive, but is the correct behavior as defined for IEEE single-precision floating point arithmetic .

You can try the same experiment by replacing float with double . You will see the same strange behavior, but this time the loop will stop when n is much larger, because IEEE double-precision floating point numbers give much more accurate accuracy.

+4
source

A floating point representation (usually IEEE754) uses trailing bits to represent decimal places, so floating point operations result in a loss of precision .

Usually, contrary to common sense, comparisons like a == ((a+1)-1) lead to false if a is a floating point variable.

Decision:

To compare two floating point points, you should use some sort of "loss accuracy range". That is, if the number differs from another, smaller than the accuracy-loss range, you consider that the numbers are equal:

 //Supposing we can overload operator== for floats bool operator==( float lhs , float rhs) { float epsilon = std::numeric_limits<float>.epsilon(); return std::abs(lhs-rhs) < epsilon; } 
+3
source

A double has more bits and therefore contains more information than a float . When you add values ​​to a float, it will round the information at different times for sum_array1 vs sum_array2.

Depending on the input values, you may get the same problem when using double as a float (if the values ​​are large enough).

A web search for “everything you need to know about floating point numbers” will give you a good overview of the limitations and best solutions.

+2
source

Source: https://habr.com/ru/post/1500960/


All Articles