Why does a union consume more memory if the argument is a set?

I am puzzled by this set s memory allocation behavior:

 >>> set(range(1000)).__sizeof__() 32968 >>> set(range(1000)).union(range(1000)).__sizeof__() # expected, set doesn't change 32968 >>> set(range(1000)).union(list(range(1000))).__sizeof__() #expected, set doesn't change 32968 >>> set(range(1000)).union(set(range(1000))).__sizeof__() # not expected 65736 

Why does using the set argument as an argument double the amount of memory used by the set result? The result in both cases is identical to the original set :

 >>> set(range(1000)) == set(range(1000)).union(range(1000)) == set(range(1000)).union(set(range(1000))) True 

Note that the same thing happens using a regular iterator:

 >>> set(range(1000)).union(iter(list(range(1000)))).__sizeof__() 32968 

And using the update method:

 >>> a.update(range(1000)) >>> a.__sizeof__() 32968 >>> a.update(set(range(1000))) >>> a.__sizeof__() 65736 

At first I thought this was because when union is called, it sees that the size of the other set is 1000 and, therefore, decides to allocate enough memory to match all elements as set s, but then it uses only part of that memory, whereas in the case of an iterator, it simply performs iterators over it and adds the elements one by one (which does not consume more memory, since all the elements are already in set ).

But range also a sequence, as well as list in the first example.

 >>> len(range(1000)) 1000 >>> range(1000)[100] 100 

So why doesn't this happen with range and list , but only with set ? Is there any design decision behind this or is it a mistake?




Tested on python 2.7.3 and python 3.2.3 on 64-bit Linux.

+12
python memory-management set
Mar 04 '13 at 9:14
source share
1 answer

In Python 2.7.3, set.union() delegates a C function called set_update_internal() . The latter uses several different implementations depending on the type of the Python argument. This multiplicity of implementations explains the difference in behavior between the tests you performed.

The implementation used with the set argument makes the following assumption, fixed in the code:

 /* Do one big resize at the start, rather than * incrementally resizing as we insert new keys. Expect * that there will be no (or few) overlapping keys. */ 

It is understood that the assumption of (or several) overlapping keys is incorrect in your particular case. This is what leads to the final set memory.

I am not sure what I would call this a mistake. The set developer chose what looks like a reasonable compromise to me, and you just ended up on the wrong side of this compromise.

The positive side of the trade-off is that in many cases, pre-distribution leads to increased productivity:

 In [20]: rhs = list(range(1000)) In [21]: %timeit set().union(rhs) 10000 loops, best of 3: 30 us per loop In [22]: rhs = set(range(1000)) In [23]: %timeit set().union(rhs) 100000 loops, best of 3: 14 us per loop 

Here, the version of set is twice as fast, apparently because it does not reallocate memory since it adds elements from rhs .

If the general purpose is to break a deal, there are several ways to get around it, some of which you have already discovered.

+9
Mar 04 '13 at 9:24
source share



All Articles