Java hashing - structure and access time

I am looking for a check on two different but related arguments: above (A) and below (B) the first comment line is here in Q.

(A) HashMap structuring method:

a HashMap is a simple table. this is direct memory access (DMA).

The whole idea of HashMap (or hashing in general) in the first place is to use this constant access to memory for

a.) access to records by their own data content (<K, V>), and not by their location in the DMA (table index)

b.) management of a variable number of records - several records of an unspecified size and may / may not remain constant in size while using this structure.

So, the general structure in Java Hash:

table: table // use the identifier used in the HashMap

each cell of this table is a bucket .

Every tag - this is a linked list of type Entry - that is, each node of this linked list (not a linked Java / API list, but a data structure) is of type Entry , which, in turn, is <K, V>.

When a new pair enters the hash, a unique hashCode K, V> is calculated for this. This hashCode is the key to the index of this <K, V> in the table - it says which drives this & lt; K, V> will enter a hash. Note: hashCode is "normalized" through the hash () function (in the HashMap for one) to better match the current table length . indexFor () is also used to determine which bucket, that is, the table cell, K, V> will go in.

When a bucket is defined, the value of <K, V> is added to the top of the linked list in this bucket - as a result, this is the first <K, V> in this bucket, and the first record of the linked list that already exists is now the "next" entry on which indicates this recently added.

// ================================================== ================

(B) From what I see in HashMap , changing the size of the table - the hash is performed only by a decision based on the hash size and capacity, which are current and max. # entries in the entire hash.

There is no restructuring or resizing of individual bucket sizes - for example, "resize (), when the maximum # entries in the bucket exceed such" such ".

Incredibly, it is possible that a significant number of entries can be filled into the bucket, while the rest of the hash is pretty empty.

If this is the case, that is, there is no upper limit on the size of each bucket, the hash does not have constant, and linear access, theoretically, for one. It takes $ O (n) $ time to get a hash entry, where $ n $ is the total number of entries. But then this should not be.

// ================================================== ================

I don’t think I missed anything in part (A) above.

I'm not quite sure about part (B). This is a serious problem, and I want to find out how accurate this argument is.

I am looking for verification on both parts.

Thanks in advance.

// ================================================== ================

EDIT:

The maximum bucket size is fixed, i.e. the hash is restructured every time #entries in the bucket reaches its maximum, will allow it - the access time is just constant in theory and in use.

This is not a well-structured, but quick fix, and it will work just fine to ensure continued access.

The hash codes are likely to be evenly distributed across all buckets, and it is unlikely that any of the buckets will fall into the max bucket before the total hash size threshold is reached. This assumption also uses the current HashMap setting.

Also based on Peter Laurie's discussion below.

+4
source share
2 answers

HashMap collisions are only a problem in pathological cases such as denial of service attacks.

In Java 7, you can change the hashing strategy so that the outside cannot predict your hashing algorithm.

AFAIK, in Java 8 HashMap for a string key will use a tree map instead of a linked list for collisions. This means that O (ln N) is the worst case instead of O (n) access time.

+3
source

I want to increase the size of the table when everything is in the same hash. The hash mapping changes when the table size is executed.

Your idea sounds good. And this is completely true and basically what HashMap does when the size of the table is less than desired / the average number of elements per bucket becomes too large. It does not do this by looking at each bucket and checking if there are too many there, because it is easy to figure it out.

Implementation of HashMap.get() in OpenJDK according to this

 public V get(Object key) { if (key == null) return getForNullKey(); int hash = hash(key.hashCode()); for (Entry<K,V> e = table[indexFor(hash, table.length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) return e.value; } return null; } 

This shows how the HashMap finds the elements pretty good, but it is written in very confusing ways. After a little renaming, commenting, and rewriting, it might look something like this:

 public V get(Object key) { if (key == null) return getForNullKey(); // get key hash & try to fix the distribution. // -> this can modify every 42 that goes in into a 9 // but can't change it once to a 9 once to 8 int hash = hash(key.hashCode()); // calculate bucket index, same hash must result in same index as well // since table length is fixed at this point. int bucketIndex = indexFor(hash, table.length); // we have just found the right bucket. O(1) so far. // and this is the whole point of hash based lookup: // instantly knowing the nearly exact position where to find the element. // next see if key is found in the bucket > get the list in the bucket LinkedList<Entry> bucketContentList = table[bucketIndex]; // check each element, in worst case O(n) time if everything is in this bucket. for (Entry entry : bucketContentList) { if (entry.key.equals(key)) return entry.value; } return null; } 

We see here that the bucket really depends on both the .hashCode() returned from each key object and the current size of the table. And that will usually change. But only in cases where .hashCode() is different.

If you had a huge table with 2 ^ 32 elements, you could just say bucketIndex = key.hashCode() , and that would be as beautiful as it could be. Unfortunately, there is not enough memory for this, so you need to use fewer buckets and display 2 x 32 hashes in multiple buckets. What indexFor essentially does. Matching a large number of spaces to a small one.

This is perfectly normal in the typical case when (almost) no object has the same .hashCode() any other. But one thing you should not do with HashMaps is to add only items with exactly the same hash.

If each hash is the same, hash-based search results in the same bucket, and all your HashMap become LinkedList (or any data structure contains bucket elements). And now you have the worst O (N) access time option, because you need to iterate over all N elements.

+1
source

Source: https://habr.com/ru/post/1494825/


All Articles