Why does adding multiple nan to the python dictionary produce multiple entries?

Question

Why does adding multiple nan to the python dictionary produce multiple entries?

Example problem:

import numpy as np dc = dict() dc[np.float('nan')] = 100 dc[np.float('nan')] = 200

It creates several entries for nan , for example

dc.keys() will create {nan: 100, nan: 200} , but it should create {nan: 200} .

+5

python dictionary

durjoy Jul 25 '17 at 10:18

source share

3 answers

randomir · Answer 1 · 2017-07-25T14:28:00+0000

The short answer to your question (about why adding NaN keys in a Python dict creates multiple records) is that the NaN floating point NaN unordered , i.e. a NaN value is not equal to, greater or less than anything, including it. This behavior is defined in the IEEE 754 standard for floating point arithmetic. An explanation of why this is given by an IEEE 754 committee member in this answer .

For a longer Python-specific answer, we’ll first look at how word insertion and key comparison work in CPython dictionaries.

When you say d[key] = val , PyDict_SetItem() is called for the d dictionary, which in turn calls (internal) insertdict() , which either updates the existing dictionary or introduces a new element (possibly by changing the size of the hash table )

The first step to insert is to search for key in the hash table of dictionary keys. The general-purpose search function called in your case (from non- lookdict() keys), lookdict() .

lookdict will use a key hash value to search for a key , iterate over a list of possible keys with the same hash value, compare first by address, then by calling key s' equivalence operator (s) (see excellent comments in Objects/dictobject.c for more information on resolving hash collisions in the Python implementation of open addressing ).

Since each float('nan') has the same hash value , but each of them is different from another object (with a different "identifier", that is, a memory address), and they're not equal to their float values :

 >>> a, b = float('nan'), float('nan') >>> hash(a), hash(b) (0, 0) >>> id(a), id(b) (94753433907296, 94753433907272) >>> a == b False

when you speak:

 d = dict() d[float('nan')] = 1 d[float('nan')] = 2

lookdict will look for the second NaN by looking at its hash ( 0 ), then try to resolve the hash collision, iterate over the keys with the same hash and compare keys by identifier / address (they are different), then by calling (expensive) PyObject_RichCompareBool / do_richcompare , which in turn, calls float_richcompare , which compares floats in the same way as C:

 /* Comparison is pretty much a nightmare. When comparing float to float, * we do it as straightforwardly (and long-windedly) as conceivable, so * that, eg, Python x == y delivers the same result as the platform * C x == y when x and/or y is a NaN.

which behaves according to the IEEE 754 standard (from GNU C Library Documents ):

20.5.2 Infinity and NaN
[...]
Basic operations and mathematical functions all take infinity and NaN and produce a reasonable output. Infinity extends through computation, as you would expect: for example, 2 + & infin; =? infin ;, 4 /? = 0, atan (? Infin;) =? Pi / 2. NaN, on the other hand, infects any computations that include it. If the calculation does not give the same result, no matter what the actual value replaces NaN, the result will be NaN.
In comparative operations, positive infinity is greater than all values except itself and NaN, and negative infinity is less than all values except itself and NaN. NaN is disordered: it is not equal, more or less than anything, including it. x == x is not true if x is NaN. You can use this to check if the value is NaN or not, but the recommended testing method for NaN is with the isnan function (see floating point classes). In addition, <,>, <= and <= will throw an exception when applied to NaN.

and which will return false for NaN == NaN .

Therefore, Python decides that the second NaN object deserves a new dictionary entry. It may have the same hash, but its address and equivalence test say that it is different from all other NaN objects.

Please note that if you always use the same NaN object (with the same address), since the address is checked before the equivalent of float, you will get the expected behavior :

 >>> nan = float('nan') >>> d = dict() >>> d[nan] = 1 >>> d[nan] = 2 >>> d {nan: 2}

perigon · Answer 2 · 2017-07-25T10:23:43+0000

For historical reasons explained here, np.float('nan') == np.float('nan') - False. The rule is that you cannot have two dictionary keys that are equal to each other, so you can have two keys equal to np.float('nan') .

Of course, this behavior is contradictory and unexpected - therefore, you should avoid using np.float('nan') as the key.

Ofer sadan · Answer 3 · 2017-07-25T10:23:18+0000

As mentioned in the commentary to you, nan never "equal" to another nan , your dict writes a new key for it. This behavior is for nan values in most languages, not just python.

I would suggest not using it as a key at all, or at least explain the purpose of this, so that we can find better ways to achieve this goal without falling into such traps.

In your case, you can check this behavior for yourself:

 a=list(dc.keys()) print(a[0]==a[1]) # will output False

The output for the above code ( False ) means that the system actually has different keys that do not collide

Why does adding multiple nan to the python dictionary produce multiple entries?

More articles: