Floating point representations seem to do integer arithmetic correctly - why?

I played a bit with floating point numbers and based on what I learned about them in the past, the fact that 0.1 + 0.2 ends with something like 0.30000000000000004 does not surprise me.

What surprises me, however, is that integer arithmetic always works just fine and does not have any of these artifacts.

I first noticed this in JavaScript (Chrome V8 in node.js):

 0.1 + 0.2 == 0.3 // false, NOT surprising 123456789012 + 18 == 123456789030 // true 22334455667788 + 998877665544 == 23333333333332 // true 1048576 / 1024 == 1024 // true 

C ++ (gcc on Mac OS X) seems to have the same properties.

The end result is that integers are simple - due to the lack of a better word - work. It is only when I start using decimal numbers that everything becomes awkward.

Is this a design feature, a mathematical artifact, or some kind of optimization performed by compilers and runtimes?

+4
source share
8 answers

I am writing that under the assumption that Javascript uses a double-precision floating-point representation for all numbers.

Some numbers have an exact representation in floating point format, in particular, all integers, such as |x| < 2^53 |x| < 2^53 . In some numbers, in particular, there are no fractions, such as 0.1 or 0.2, which become infinite fractions in binary representation.

If all operands and the result of the operation have an exact representation, then it would be safe to compare the result with == .

Related questions:

What binary number can only be represented as an approximation?

Why decimal numbers cannot be represented exactly in binary format?

+3
source

Is this a design feature, a mathematical artifact, or some kind of optimization performed by compilers and runtimes?

This is a feature of real numbers. A theorem from modern algebra (modern algebra, not secondary school algebra, mathematical specialties take a class in modern algebra after their main calculus and classes of linear algebra) says that for some positive integer b, any positive real number r can be expressed as r = a * b p where a is in [1, b), and p is some integer. For example, 1024 10 = 1.024 10 * 10 3 . It is this theorem that justifies our use of scientific notation.

This number a can be classified as terminal (for example, 1.0), repeating (1/3 = 0.333 ...) or not repeating (representation pi). There is a slight problem with terminal numbers. Any terminal number can also be represented as a repeating number. For example, 0.999 ... and 1 is the same number. This ambiguity in the presentation can be resolved by indicating that the numbers that can be represented as terminal numbers are presented as such.

What you discovered is a consequence of the fact that all integers have a terminal representation in any database.

There is a problem with the way the realities are presented on the computer. Just as int and long long int do not represent all integers, float and double do not represent all reals. The scheme used on most computers to represent a real number r should be represented as r = a * 2 p but with a mantissa (or significant) truncated to a certain number of bits, and the exponent p is limited to some finite number. This means that some integers cannot be represented exactly. For example, although googol (10 100 ) is an integer, this floating point representation is not exact. The basic representation of googol is a 333-bit number. This 333-bit mantissa is truncated to 52 + 1 bits.

The consequence of this is that double-precision arithmetic is no longer accurate, even for integers, if integers are greater than 2 53 . Try the experiment using the unsigned long long int with values ​​between 2 53 and 2 64 . You will find that double precision arithmetic is no longer accurate for these large integers.

+4
source

Integers with a representable range are exactly represented by the machine, floats are not (well, most of them).

If by “basic integer math” you understand “function”, then yes, you can assume that the correct implementation of arithmetic is a function.

+2
source

The reason is because you can represent each integer (1, 2, 3, ...) exactly in binary format (0001, 0010, 0011, ...)

This is why integers are always correct, because 0011 - 0001 is always 0010. The problem with floating point numbers is that the part after the period cannot be exactly converted to binary.

+2
source

All cases that you say “work” are those in which the numbers you have indicated can be represented exactly in floating point format. You will find that adding 0.25 and 0.5 and 0.125 works exactly because they can also be represented exactly in binary floating point numbers.

these are only values ​​that cannot be like 0.1, where you get what seems like an inaccurate result.

+1
source

Integers are accurate, because inaccuracy arises mainly from the way we write decimal fractions, and secondly, because many rational numbers simply do not have unique representations in any given base.

See fooobar.com/questions/1105739 / ... for a full explanation.

+1
source

This method only works when you add a sufficiently small integer to a very large integer - and even then you do not represent both integers in floating point format.

0
source

All floating point numbers cannot be represented. this is because of the way they are encoded. The wiki page explains this better than me: http://en.wikipedia.org/wiki/IEEE_754-1985 . Therefore, when you try to compare a floating point number, you should use delta:

 myFloat - expectedFloat < delta 

You can use the smallest representable floating point number as delta.

-1
source

Source: https://habr.com/ru/post/1441996/


All Articles