Hash table implementation

I started reading about the implementation of various data structures a few days ago, and went around the hash tables and got stuck at a certain point.

My understanding of how a hash table is implemented: Key K is passed to the hash function H, which returns a hashed version of K, HK. HK probably should have at least uint32_t for conflict resolution, we have an array of size X, the element is stored in the HK index of this array .. but this does not require a pre-allocated array of length uint32_t atleast (or whatever was the return value H)? assuming we don’t store data in this array and instead save ptr for data, then we need a pintr_t array of length uint32_t .. this seems pretty wasteful, on 64-bit, which would mean memory usage: 2 ^ 32 * 8 = 34359738368 bytes or ~ 32 GB only for the ptrs array for data, which is obviously not the way it is actually implemented in real life.

So what am I missing?

+4
source share
4 answers

It depends on the implementation. This is done in three main ways:

1) Small hashes are used. Therefore, instead of using a 32-bit hash, for example, an 8-bit hash is used.

2) Several hash levels are used. So, for example, a 12-bit hash can determine which bucket the record is in, but a collision occurs only when the full 32-bit hash matches. Each bucket is stored in a linked list or similar structure. (Perhaps one is optimized to find the full 32-bit hash inside it.)

3) Sparse arrays are used. These are data structures that do not require actual storage of spaces for empty slots. (In practice, it may be something completely different, such as a tree, but it acts like a sparse array with efficient search.)

+4
source

You must create your hash table so that it can be expanded. There are several ways to do this. Read this , it will be helpful. This example uses a linked list. And you also need to expand your table if there are no more empty values. You will have the following problem: if you expand your map, your H function may return new HK values ​​for old K keys. Therefore, you should consider how to solve this problem. One way is to reload all values ​​when the table has been expanded. This is normal if you do not expand it often.

+2
source

Actually, you have an array of fewer fixed quantities that either use the chain (the result is in a linked list) or probing (the worst example: if a hash (x) is running, try a hash (x) +1) in collisions. You take your uint32 and mod to the size of the bucket, the simplest case.

You can determine the load factor - as soon as you get to N% of the filled array, we say, we double the size of the array and rephrase everything into a new array. Say somewhere between 50% and 75% use.

Well, isn't it that expensive, you say? Well, not quite. Let them say that you double the size of the array every time. Thus, you add N elements, the last of which starts a copy. N is added to O (1), and then one O (N) copy. But wait-O (N) / N averages the value of O (1), so the amortized average cost of adding is still O (1), provided that your load factor is chosen wisely,

+2
source

A typical implementation of hash tables is an array of linked lists. A linked list can easily be replaced for another data structure, so we will call it Bucket from now on.

The idea is simple:

 class HashTable { public: private: std::vector<Bucket*> _array; }; 

Then you take HK and reduce it to fit it in an array, usually with a module: HK % size(_array) , which gives the index of the bucket used.

+1
source

Source: https://habr.com/ru/post/1391984/


All Articles