Size efficient dictionary (associative array)

Question

Size efficient dictionary (associative array)

What algorithms are available for an effective size dictionary or associative array ? For example, using this set of keys / values, how can I avoid duplicating Alice in values?

{ "Pride and Prejudice": "Alice", "The Brothers Karamazov": "Pat", "Wuthering Heights": "Alice" }

I checked the Python implementation in the dictionary , but it looks like the implementation is speed oriented (keeping O (1)) not size.

+4

python dictionary algorithm

prosseek Jul 9 '13 at 16:28

source share

3 answers

One way to improve space efficiency (in addition to sharing values that (as bennofs points out in the comments), you can probably do efficiently with sys.intern) is to use hopscotch hashing , which is an open addressing scheme (option linear research) to resolve conflicts. Private addressing schemes use more space because you need to allocate a linked list for each bucket, whereas with an open addressing scheme, you simply use an open neighboring slot in the swap array without requiring any linked lists to be allocated. Unlike other open addressing schemes (such as cuckoo hashing or linear vanilla sensing), hopscotch hashing works well with a high load factor (over 90%) and guarantees a constant search for time.

+1

Zim-Zam O'Pootertoot Jul 9 '13 at 16:55

source share

If your dictionary can fit in memory, then you can use a simple hashtable.

Try pasting each key value into a hash table. If the key alredy exists before the insert, then you have found duplication. There are a number of hashtable implementations in many languages.

There are basically two times: array and tree.

Array focuses on speed at the high cost of memory. The main difference between the Hashtable implementation is its unified behavior; some implementation ensures the unity of some others.
Focus on intelligent memory usage at the cost of using O (log (n)). The g ++ map relies on the very full power of red ebony .

If size is very problematic, you should search for Huffman compression and / or Lampel Ziv compression, but it cost a bit more to adapt to dictionnary.

If your dictionnary cannot fit in memory

You should look at the database. Red ebony for the database is known as BTree (almost). It has factorizer optimization for the low latency hard drive case.

I have added many links to Wikipedia, but if you like this question, I recommend you:

0

Galigator Jul 9 '13 at 16:55

source share

Zero piraeus · Accepted Answer · 2013-07-09T16:59:51+0000

As mentioned in the comments of bennofs , you can use intern() so that identical strings are saved only once:

 class InternDict(dict): def __setitem__(self, key, value): if isinstance(value, str): super(InternDict, self).__setitem__(key, intern(value)) else: super(InternDict, self).__setitem__(key, value)

Here is an example of an effect that has:

 >>> d = {} >>> d["a"] = "This string is presumably too long to be auto-interned." >>> d["b"] = "This string is presumably too long to be auto-interned." >>> d["a"] is d["b"] False >>> di = InternDict() >>> di["a"] = "This string is presumably too long to be auto-interned." >>> di["b"] = "This string is presumably too long to be auto-interned." >>> di["a"] is di["b"] True

Size efficient dictionary (associative array)

More articles: