Why are two variables of type float different?

Question

Why are two variables of type float different?

I have two integer vectors about 1000 in size, and I'm going to check if the sum of a square integer for these two vectors is the same or not. Therefore, I write the following codes:

std::vector<int> array1; std::vector<int> array2; ... // initialize array1 and array2, and in the experiment all elements // in the two vectors are the same but the sequence of elements may be different. // For example: array1={1001, 2002, 3003, ....} // array2={2002, 3003, 1001, ....} assert(array1.size() == array2.size()); float sum_array1 = 0; float sum_array2 = 0; for(int i=0; i<array1.size(); i++) sum_array1 +=array1[i]*array1[i]; for(int i=0; i<array2.size(); i++) sum_array2 +=array2[i]*array2[i];

I expect that sum_array1 should be equal to sum_array2 , but actually in my application I found that they were different sum_array1 = 1.2868639e+009 , and sum_array2 = 1.2868655e+009 . What I did next was to change the type of sum_array1 and sum_array2 to a double type, as the following codes show:

  double sum_array1 = 0; double sum_array2 = 0; for(int i=0; i<array1.size(); i++) sum_array1 +=array1[i]*array1[i]; for(int i=0; i<array2.size(); i++) sum_array2 +=array2[i]*array2[i];

This time sum_array1 is equal to sum_array2 sum_array1=sum_array2=1286862225.0000000 . My question is why this could happen. Thank you

+4

c ++

feelfree Sep 06 '13 at 15:46

source share

4 answers

In two cycles, you add the same numbers, but in different orders. Once the amounts exceed an integer value that can be exactly represented by a float , you will begin to lose accuracy, and the amounts may vary slightly.

Experiment for you:

 float n = 0; while (n != n + 1) n = n + 1; //Will this terminate? If so, what is n now?

If you run this, you will find that the loop actually completes - which seems completely counter-intuitive, but is the correct behavior as defined for IEEE single-precision floating point arithmetic .

You can try the same experiment by replacing float with double . You will see the same strange behavior, but this time the loop will stop when n is much larger, because IEEE double-precision floating point numbers give much more accurate accuracy.

+4

Timothy shields Sep 06 '13 at 15:56

source share

A floating point representation (usually IEEE754) uses trailing bits to represent decimal places, so floating point operations result in a loss of precision .

Usually, contrary to common sense, comparisons like a == ((a+1)-1) lead to false if a is a floating point variable.

Decision:

To compare two floating point points, you should use some sort of "loss accuracy range". That is, if the number differs from another, smaller than the accuracy-loss range, you consider that the numbers are equal:

 //Supposing we can overload operator== for floats bool operator==( float lhs , float rhs) { float epsilon = std::numeric_limits<float>.epsilon(); return std::abs(lhs-rhs) < epsilon; }

+3

Manu343726 Sep 06 '13 at 15:49

source share

A double has more bits and therefore contains more information than a float . When you add values to a float, it will round the information at different times for sum_array1 vs sum_array2.

Depending on the input values, you may get the same problem when using double as a float (if the values are large enough).

A web search for “everything you need to know about floating point numbers” will give you a good overview of the limitations and best solutions.

+2

josh poley Sep 06 '13 at 15:52

source share

Mike seymour · Accepted Answer · 2013-09-06T16:00:54+0000

Floating-point values are finite in size and therefore can only represent real values with finite precision. This leads to rounding errors when you need higher precision than they can store.

In particular, when adding a small number (for example, summing) to a much larger number (for example, your battery), the accuracy loss can be quite large compared to a small number, which gives a significant error; and errors will vary by order.

Typically, a float has 24 bits of precision, which corresponds to about 7 decimal places. Your battery needs 10 decimal places (about 30 bits), so you will lose this accuracy. As a rule, double has 53 bits (about 16 decimal places), so your result can be represented exactly.

A 64-bit integer might be the best option here, since all inputs are integers. Using the whole avoids the loss of accuracy, but poses a danger of overflow if the inputs are too large or too large.

To minimize error, if you cannot use a sufficiently large drive, you can sort the input so that the smallest values are accumulated first; or you can use more sophisticated methods, such as summing Kahan .

Why are two variables of type float different?

Decision:

More articles: