What is the most efficient way to assign a unique integer id to a string?

The program that I write processes a large number of objects, each with its own unique identifier, which itself is a string of complex structure (a dozen unique fields of an object connected by some kind of separator) and a long length.

Since I have to quickly process many of these objects, and I need to re-access them by id during processing, and I don’t have the ability to change their format (I extract them from the outside, over the network), I want to display their complex string id to to my own internal identifier of an integer and then use it for comparison, for further transferring them to other processes, etc.

What I'm going to do is use a simple dict with keys as the string identifier of the object and integer values ​​as my internal integer identifier.

My question is: is there a better way in Python to do this? Maybe there is a way to calculate some hash manually, independently? Maybe dictating is not the best solution?

As for numbers: in the system there are approximately 100K of such unique objects at a time, so the integer capacity is more than enough.

+4
source share
6 answers

For comparison, you can intern strings and then compare them with is instead of == , which makes simple pointer comparisons and should be as fast as (or faster) by comparing two integers:

 >>> 'foo' * 100 is 'foo' * 100 False >>> intern('foo' * 100) is intern('foo' * 100) True 

intern guarantees that id(intern(A)) == id(intern(B)) iff A == B Be sure to intern any string as soon as it is entered. Note that intern is called sys.intern in Python 3.x.

But when you need to pass these lines to other processes, your dict solution seems to be the best. What I usually do in situations like this

 str_to_id = {} for s in strings: str_to_id.setdefault(s, len(str_to_id)) 

therefore the integer capacity is more than enough

Python integers are bigint, so this should never be a problem.

+9
source

What about the hash function?

 In [130]: hash Out[130]: <function hash> In [131]: hash('foo') Out[131]: -740391237 

There is no need to store hashes (if you don’t want to): the point is that they are equal for objects that are equal in value (although the converse can be wrong - there is no doubt about unequal strings or other objects, the hash to a single value is as follows nature of hashing).

If you know the range of your keys (and probably you do), you can also use the perfect hash generator. This is apparently for python: http://ilan.schnell-web.net/prog/perfect-hash/

Ideal hashes ensure that keys in a specified range have a bijective relationship with their hash value.

+5
source

You can use one of the hashlib algorithms to create a cryptographically sound digest for a long message, and then use it as dictionary keys. Example of using SHA-256:

 import hashlib ... key = hashlib.sha256(longMessage).digest() 

The chance of collision is much less than when using hash (longMessage).

However, this can lead to potentially large overhead. If memory usage is not a big concern, I would just use the source strings as keys.

+4
source

For this purpose, I used the following:

 >>> from collections import defaultdict >>> d = defaultdict(lambda: len(d)) >>> d["cats"] 0 >>> d["cars"] 1 >>> d["cats"] 0 
+3
source

If they are stored in memory, and you compare each line as an object, and not as text, I would suggest using id(string) to get a unique integer. Also, if you store them in a dict, you can use defaultdict with a set of matches and hash them:

 >>> strings = 'a whole lot of strings which may share a hash'.split() >>> storage = defaultdict(set) >>> for s in strings: ... storage[hash(s)].add(s) >>> storage[hash('a')] {'a', 'a'} 

Exactly how you implement this depends on how you use them, but the basic idea should work. If you can post a concrete example of what you are trying to do, it may be easier to give a more detailed answer.

+1
source

dict is a great solution. If you have a way to generate a unique identifier based on a string identifier, you can have a double function as a hash function for a custom string class:

 class ID_String(str): cached_hash = None def __hash__(self): # custom hash code here return custom_hash def ID(self): if self.cached_hash is None: self.cached_hash = self.__hash__() return self.cached_hash 
+1
source

Source: https://habr.com/ru/post/910285/


All Articles