Size efficient dictionary (associative array)

What algorithms are available for an effective size dictionary or associative array ? For example, using this set of keys / values, how can I avoid duplicating Alice in values?

{ "Pride and Prejudice": "Alice", "The Brothers Karamazov": "Pat", "Wuthering Heights": "Alice" } 

I checked the Python implementation in the dictionary , but it looks like the implementation is speed oriented (keeping O (1)) not size.

+4
source share
3 answers

As mentioned in the comments of bennofs , you can use intern() so that identical strings are saved only once:

 class InternDict(dict): def __setitem__(self, key, value): if isinstance(value, str): super(InternDict, self).__setitem__(key, intern(value)) else: super(InternDict, self).__setitem__(key, value) 

Here is an example of an effect that has:

 >>> d = {} >>> d["a"] = "This string is presumably too long to be auto-interned." >>> d["b"] = "This string is presumably too long to be auto-interned." >>> d["a"] is d["b"] False >>> di = InternDict() >>> di["a"] = "This string is presumably too long to be auto-interned." >>> di["b"] = "This string is presumably too long to be auto-interned." >>> di["a"] is di["b"] True 
+1
source

One way to improve space efficiency (in addition to sharing values ​​that (as bennofs points out in the comments), you can probably do efficiently with sys.intern) is to use hopscotch hashing , which is an open addressing scheme (option linear research) to resolve conflicts. Private addressing schemes use more space because you need to allocate a linked list for each bucket, whereas with an open addressing scheme, you simply use an open neighboring slot in the swap array without requiring any linked lists to be allocated. Unlike other open addressing schemes (such as cuckoo hashing or linear vanilla sensing), hopscotch hashing works well with a high load factor (over 90%) and guarantees a constant search for time.

+1
source
  • If your dictionary can fit in memory, then you can use a simple hashtable.

Try pasting each key value into a hash table. If the key alredy exists before the insert, then you have found duplication. There are a number of hashtable implementations in many languages.

There are basically two times: array and tree.

  • Array focuses on speed at the high cost of memory. The main difference between the Hashtable implementation is its unified behavior; some implementation ensures the unity of some others.

  • Focus on intelligent memory usage at the cost of using O (log (n)). The g ++ map relies on the very full power of red ebony .

If size is very problematic, you should search for Huffman compression and / or Lampel Ziv compression, but it cost a bit more to adapt to dictionnary.

  • If your dictionnary cannot fit in memory

You should look at the database. Red ebony for the database is known as BTree (almost). It has factorizer optimization for the low latency hard drive case.

I have added many links to Wikipedia, but if you like this question, I recommend you:

Introduction to algorithms

0
source

Source: https://habr.com/ru/post/1490531/


All Articles