Python set intersection - which objects are returned

Question

Python set intersection - which objects are returned

I have a question that is not entirely clear in the python documentation ( https://docs.python.org/2/library/stdtypes.html#set.intersection ).

When using set.intersection, does the result set contain objects from the current set or from another? In case both objects have the same value, but are different objects in memory.

I use this to compare the previous extraction from a file with the new one coming from the Internet. Both have some objects similar, but I want to update the old ones. Or maybe there is a simpler alternative to achieve this? It would be much easier for me if the sets implemented __getitem__ .

  oldApsExtract = set() if (os.path.isfile("Apartments.json")): with open('Apartments.json', mode='r') as f: oldApsExtract = set(jsonpickle.decode(f.read())) newApsExtract = set(getNewExtract()) updatedAps = oldApsExtract.intersection(newApsExtract) deletedAps = oldApsExtract.difference(newApsExtract) newAps = newApsExtract.difference(oldApsExtract) for ap in deletedAps: ap.mark_deleted() for ap in updatedAps: ap.update() saveAps = list(oldApsExtract) + list(newAps) with open('Apartments.json', mode='w') as f: f.write(jsonpickle.encode(saveAps))

+5

python set

husvar Dec 30 '15 at 23:31

source share

2 answers

Padraic cunningham · Answer 1 · 2015-12-30T23:50:30+0000

Which objects are used, changes if the sets are the same size, intersecting elements from b, if b has more elements, then objects from a are returned:

 i = "$foobar" * 100 j = "$foob" * 100 l = "$foobar" * 100 k = "$foob" * 100 print(id(i), id(j)) print(id(l), id(k)) a = {i, j} b = {k, l, 3} inter = a.intersection(b) for ele in inter: print(id(ele))

Output:

 35510304 35432016 35459968 35454928 35510304 35432016

Now that they are the same size:

 i = "$foobar" * 100 j = "$foob" * 100 l = "$foobar" * 100 k = "$foob" * 100 print(id(i), id(j)) print(id(l), id(k)) a = {i, j} b = {k, l} inter = a.intersection(b) for ele in inter: print(id(ele))

Output:

 35910288 35859984 35918160 35704816 35704816 35918160

The relevant part of the source. The string if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) , n the result of the comparison, apparently, determines which object to iterate over and which objects will be used.

  if (PySet_GET_SIZE(other) > PySet_GET_SIZE(so)) { tmp = (PyObject *)so; so = (PySetObject *)other; other = tmp; } while (set_next((PySetObject *)other, &pos, &entry)) { key = entry->key; hash = entry->hash; rv = set_contains_entry(so, key, hash); if (rv < 0) { Py_DECREF(result); return NULL; } if (rv) { if (set_add_entry(result, key, hash)) { Py_DECREF(result); return NULL; }

If you pass an object that is not a set, then this is not true, and the length does not matter, since objects from iterable are used:

 it = PyObject_GetIter(other); if (it == NULL) { Py_DECREF(result); return NULL; } while ((key = PyIter_Next(it)) != NULL) { hash = PyObject_Hash(key); if (hash == -1) goto error; rv = set_contains_entry(so, key, hash); if (rv < 0) goto error; if (rv) { if (set_add_entry(result, key, hash)) goto error; } Py_DECREF(key);

When you pass iterability, firstly, it can be an iterator, so you cannot check the size without consuming, and if you pass the list, the search will be 0(n) , so it makes sense to simply iterate over the iteration passed to, on the contrary, if if you have a set of elements of 1000000 and one with 10 , it makes sense to check if 10 in the set if 1000000 as opposed to checking if any of the 1000000 in your set of 10 , since the search should be 0(1) in average, therefore, it means a linear passage through 10 against a linear passage over 1,000,000 eleme Tami.

If you look at wiki.python.org/moin/TimeComplexity , this is a backup:

Middle case → Intersection s & t O (min (len (s), len (t))
Worst case -> O (len (s) * len (t)) O (len (s) * len (t))
replace "min" with "max" if t is not a set

So, when we pass the iterable, we should always get the objects from b:

 i = "$foobar" * 100 j = "$foob" * 100 l = "$foobar" * 100 k = "$foob" * 100 print(id(i), id(j)) print(id(l), id(k)) a = {i, j} b = [k, l, 1,2,3] inter = a.intersection(b) for ele in inter: print(id(ele))

You get objects from b:

 20854128 20882896 20941072 20728768 20941072 20728768

If you really want to decide which objects you have, iterate and search, saving depending on what you want.

Untitled123 · Answer 2 · 2015-12-30T23:37:14+0000

One thing you can do is use python dictionaries. Access is still O (1), elements are easily accessible, and a simple loop like the following can get the intersection function:

  res=[] for item in dict1.keys(): if dict2.has_key(item): res.append(item)

The advantage here is that you have complete control over what is happening and you can customize it as you need. For example, you can also do things like:

 if dict1.has_key(item): dict1[item]=updatedValue

Python set intersection - which objects are returned

More articles: