Hashcode bucket allocation in java

Question

Hashcode bucket allocation in java

Suppose I need to store 1000 objects in a Hashset, is it better that I have 1000 buckets containing each object (by generating a unique hash code value for each object) or 10 baskets containing about 100 objects?

One advantage of having a unique bucket is that I can save a run loop when the equals () method is called?

Why is it important to set the number of buckets and distribute objects as evenly as possible?

What should be the ideal object for the bucket ratio?

+6

java collections hash bucket

Jyotirup Jul 13 '12 at 10:22

source share

3 answers

About one bucket per element is better for the processor; too many baskets are bad for memory. Java will start with a small number of buckets and automatically increase the bandwidth of your HashSet after filling it, so you really do not need to worry if your application has no performance problems and you have identified hashset as the reason.

If you have multiple items in each bucket, searches begin to take longer. If you have many empty buckets, you are using more memory than you need, and iterating over the elements takes longer.

This seems like a premature optimization that expects this to happen - the default constructor is fine in most cases.

+1

Jacob Raihle Jul 13 '12 at 10:28

source share

Object.hashCode() are of type int , you can only have 2 ^ 32 different values, because of which you create buckets and distribute objects between them.

Edit: If you use 2^32 buckets to store a 2 ^ 32 object, then the defiantly received operations will give you constant complexity, but when you insert one on one element to store 2^32 objects, then repeated playback will be performed, which means that if we use Object[] as codes, and each time it exceeds the length of the array , it will create a new array with a large size and copy the elements into it. this process will increase complexity. Therefore, we use equals and hashcode in proportion, and this is done by Hashsets itself, providing a better hashing algorithm .

+1

amicngh Jul 13 '12 at 10:29

source share

Mark byers · Accepted Answer · 2012-07-13T10:26:03+0000

Why is it important to set the number of buckets and distribute objects as evenly as possible?

A HashSet must be able to determine membership O (1) times on average. From the documentation :

This class offers consistent time performance for basic operations (add, delete, presence and size), assuming the hash function correctly distributes items among buckets.

The HashSet algorithm uses this to obtain a hash code for an object and use it to find the correct bucket. Then he goes through all the elements in the bucket until he finds one that is equal. If the number of items in the bucket is greater than O (1), the search will take longer O (1).

In the worst case scenario — if all the hashes of the elements belong to the same bucket — it takes O (n) time to determine if the object is in the set.

What should be the ideal object for the bucket ratio?

There is a trade-off between space and time. Increasing the number of buckets reduces the chance of collisions. However, this also increases memory requirements. The hash set has two parameters, initialCapacity and loadFactor , which allow you to configure the number of buckets created by the HashSet . The default load factor is 0.75, which is fine for most purposes, but if you have special requirements, you can choose a different value.

More information on these options can be found in the documentation for HashMap :

This implementation provides consistent performance for basic operations (get and put), assuming that the hash function correctly distributes the elements among the buckets. Iterating over collection views takes time proportional to the "capacity" of the HashMap instance (number of buckets) plus its size (number of key value mappings). Therefore, it is very important not to set the initial power too high (or the load factor too low) if iterative performance is important.
A HashMap instance has two parameters that affect its performance: initial capacity and load factor. Capacity is the number of buckets in the hash table, and the initial capacity is just the capacity at the time the hash table was created. The load factor is an indicator of how much a complete hash table can be obtained before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity approximately doubles, calling the renaming method.
Typically, the default load factor (.75) provides a good compromise between time and space. Higher values reduce the amount of overhead, but increase the cost of the search (reflected in most operations of the HashMap class, including get and put). The expected number of entries on the map and load factor should be taken into account when setting up the initial capacity to minimize the number of paraphrase operations. If the initial capacity is greater than the maximum number of records divided by the load factor, no rephrasing will ever occur.

Hashcode bucket allocation in java

More articles: