Python string interning function replication attempt for non-strings

Question

Python string interning function replication attempt for non-strings

For an independent project, I wanted to do something like:

class Species(object): # immutable. def __init__(self, id): # ... (using id to obtain height and other data from file) def height(self): # ... class Animal(object): # mutable. def __init__(self, nickname, species_id): self.nickname = nickname self.species = Species(id) def height(self): return self.species.height()

As you can see, I really do not need more than one Species (id) instance for each identifier, but I will create it every time I create an Animal object with this id, and I would probably need several calls, for example, Animal(somename, 3) .

To solve this, I'm trying to make a class so that for two instances of it, say a and b, the following is always true:

 (a == b) == (a is b)

This is what Python does with string literals and is called internship . Example:

 a = "hello" b = "hello" print(a is b)

that print will give true (as long as the string is short enough if we use the python shell directly).

I can only guess how CPython does it (this is probably due to some C magic), so I am making my own version. So far I have:

 class MyClass(object): myHash = {} # This replicates the intern pool. def __new__(cls, n): # The default new method returns a new instance if n in MyClass.myHash: return MyClass.myHash[n] self = super(MyClass, cls).__new__(cls) self.__init(n) MyClass.myHash[n] = self return self # as pointed out on an answer, it better to avoid initializating the instance # with __init__, as that one called even when returning an old instance. def __init(self, n): self.n = n a = MyClass(2) b = MyClass(2) print a is b # <<< True

My questions:

a) Is my problem even worthy of a solution? Since my intended Speces object should be quite light and the maximum number of times that Animal can be called quite limited (imagine a Pokemon game: no more than 1000 copies, vertices)

b) If so, is this the right approach to solve my problem?

c) If this is not valid, could you talk about a simpler / cleaner / more pythonic way to solve this problem?

+5

python python-2.7

Dleep Oct 23 '15 at 0:33

source share

2 answers

To make this as general as possible, I will recommend a couple of things. One, inherits from namedtuple , if you want "true" immutability (usually people namedtuple this pretty well, but when you do interning, breaking the immutable invariant can cause much bigger problems). Secondly, use locks to ensure the safe behavior of threads.

Since this is rather complicated, I am going to provide a modified copy of the Species code with comments explaining this:

 import collections import operator import threading # Inheriting from a namedtuple is a convenient way to get immutability class Species(collections.namedtuple('SpeciesBase', 'species_id height ...')): __slots__ = () # Prevent creation of arbitrary values on instances; true immutability of declared values from namedtuple makes true immutable instances # Lock and cache, with underscore prefixes to indicate they're internal details _cache_lock = threading.Lock() _cache = {} def __new__(cls, species_id): # Switching to canonical name cls for class type # Do quick fail fast check that ID is in fact an int/long # If it int-like, this will force conversion to true int/long # and minimize risk of incompatible hash/equality checks in dict # lookup # I suspect that in CPython, this would actually remove the need # for the _cache_lock due to the GIL protecting you at the # critical stages (because no byte code is executing comparing # or hashing built-in int/long types), but the lock is a good idea # for correctness (avoiding reliance on implementation details) # and should cost little species_id = operator.index(species_id) # Lock when checking/mutating cache to make it thread safe try: with cls._cache_lock: return cls._cache[species_id] except KeyError: pass # Read in data here; not done under lock on assumption this might # be expensive and other Species (that already exist) might be # created/retrieved from cache during this time species_id = ... height = ... # Pass all the values read to the superclass (the namedtuple base) # constructor (which will set them and leave them immutable thereafter) self = super(Species, cls).__new__(cls, species_id, height, ...) with cls._cache_lock: # If someone tried to create the same species and raced # ahead of us, use their version, not ours to ensure uniqueness # If no one raced us, this will put our new object in the cache self = cls._cache.setdefault(species_id, self) return self

If you want to make interning for shared libraries (where users can be streamed, and you cannot trust them so as not to violate the invariance of immutability), then, as shown above, is the basic structure for work. This is quick, to minimize the opportunity for kiosks, even if the construction is heavy (in exchange for a possible reconstruction of the object more than once and the release of all but one copy, if many threads try to create it for the first time right away), etc.

Of course, if the construction is cheap, and the instances are small, then just write __eq__ (and maybe __hash__ if it is logically unchanged) and do with it,

0

Shadowranger Oct 23 '15 at 1:45

source share

Blckknght · Accepted Answer · 2015-10-23T01:05:50+0000

Yes, implementing the __new__ method, which returns a cached object, is an appropriate way to create a limited number of instances. If you do not expect to create a large number of instances, you can simply implement __eq__ and compare by value, and not with an identifier, but this will not hurt to do it this way.

Note that an immutable object should usually do all of its initialization in __new__ , not __init__ , since the latter is called after the object is created. In addition, __init__ will be called on any instance of the class that is returned from __new__ , so when cached, it will be called again every time a cached object is returned.

Also, the first argument to __new__ is a class object, not an instance, so you should probably call it cls , not self (you can use self instead of instance later in the method if you want!).

Python string interning function replication attempt for non-strings

More articles: