How do hash table indexes work?

I know about creating hash codes, collisions, the relationship between .GetHashCode and .Equals, etc.

What I don’t quite understand is how a 32-bit hash number is used to search for ~ O (1). If you have an array large enough to allocate all the features in a 32-bit number, you will get ~ O (1), but that would be a waste of memory.

I assume that a small array is created inside the Hashtable class (for example, 1K elements), and then paraphrase a 32-bit number to a 3-digit number and use this as a search. When the number of elements reaches a certain threshold (say 75%), it expands the array to something like 10K elements and recalculates internal hash numbers to 4-bit numbers based on a 32-bit hash, of course.

btw, here I use ~ O (1) to account for possible collisions and their resolutions.

Do I have the gist of this correct or am I not completely marked?

+4
source share
3 answers

I assume that a small array is created inside the Hashtable class (for example, 1K elements), and then paraphrase a 32-bit number to a 3-digit number and use it as a search.

This is exactly what happens, except that the capacity (number of bins) of a table is most often set by the power of two or a prime number. The hash code is then taken modulo this number to find the bit into which to insert the element. When power is equal to power, the operation of the module becomes a simple bitmask operation.

When the number of elements reaches a certain threshold (say 75%)

If you mean the Java Hashtable implementation, then yes. This is called load factor. Other implementations may use 2/3 instead of 3/4.

he expanded the array to about 10K elements

In most implementations, the throughput will not be increased tenfold, but doubled (for hash power tables with two sizes) or multiplied by about 1.5 + the distance to the next prime number.

+7
source

The hash table has several boxes containing items. The number of boxes is pretty small to begin with. Given the hash code, it just uses hashcode modulo bincount to find the bin the item should be in. This gives a quick search (find the cell for the item: take the hash code modulo done).

Or in the (pseudo) code:

 int hash = obj.GetHashCode(); int binIndex = hash % binCount; // The item is in bin #binIndex. Go get the items there and find the one that matches. 

Obviously, as you found out, at some point the table should grow. When he does this, a new array of boxes is created, and the items in the table are redistributed to new bins. It also means that hash table growth can be slow. (Thus, approximately in O (1) in most cases, if the insert does not cause an internal size. The search should always be ~ O (1)).

+2
source

In general, there are a number of changes to how hash tables handle overflows.

Many (including Java if memory is used) are resized when the load factor (percentage of cells used) exceeds a certain percentage. The disadvantage of this is that the speed is independent - most inserts will be O (1), but some of them will be O (N).

To improve this problem, resize it a bit instead: when the load factor exceeds the magic number, they:

  • Create a second (large) hash table.
  • Insert a new item into the new hash table.
  • Move some items from the existing hash table to a new one.

Then each subsequent insertion moves a different fragment from the old hash table to the new one. This preserves the average complexity of O (1) and can be written so that the complexity for each insert is essentially constant: when the hash table becomes β€œfull” (i.e., the load factor exceeds the trigger point), you double the size of the table. Then, each insert, you insert a new element and move one element from the old table to the new. The old table will be filled in exactly the same way as the new one is filled, so each insert will include exactly two operations: insert one new element and move one old, so the insert speed remains almost constant.

There are other strategies. I especially like to make a hash table a table of balanced trees. In doing so, you usually ignore overflow completely. As you fill out the hash table, you simply get more items in each tree. Theoretically, this means that the complexity is O (log N), but for any practical size it is proportional to log N/M , where M = the number of buckets. For practical size ranges (for example, up to several billion elements) that are essentially constant ( log N grows very slowly), and, and this is often a little faster for the largest table that you can put in memory, and is lost faster for smaller sizes . The disadvantage is that it is really real when the stored objects are large enough - if you saved (for example) one character per node, the overhead of two pointers (plus, usually, balance information) on the node will be extremely high.

+1
source

Source: https://habr.com/ru/post/1382337/


All Articles