Why is it important to set the number of buckets and distribute objects as evenly as possible?
A HashSet
must be able to determine membership O (1) times on average. From the documentation :
This class offers consistent time performance for basic operations (add, delete, presence and size), assuming the hash function correctly distributes items among buckets.
The HashSet
algorithm uses this to obtain a hash code for an object and use it to find the correct bucket. Then he goes through all the elements in the bucket until he finds one that is equal. If the number of items in the bucket is greater than O (1), the search will take longer O (1).
In the worst case scenario β if all the hashes of the elements belong to the same bucket β it takes O (n) time to determine if the object is in the set.
What should be the ideal object for the bucket ratio?
There is a trade-off between space and time. Increasing the number of buckets reduces the chance of collisions. However, this also increases memory requirements. The hash set has two parameters, initialCapacity
and loadFactor
, which allow you to configure the number of buckets created by the HashSet
. The default load factor is 0.75, which is fine for most purposes, but if you have special requirements, you can choose a different value.
More information on these options can be found in the documentation for HashMap
:
This implementation provides consistent performance for basic operations (get and put), assuming that the hash function correctly distributes the elements among the buckets. Iterating over collection views takes time proportional to the "capacity" of the HashMap instance (number of buckets) plus its size (number of key value mappings). Therefore, it is very important not to set the initial power too high (or the load factor too low) if iterative performance is important.
A HashMap instance has two parameters that affect its performance: initial capacity and load factor. Capacity is the number of buckets in the hash table, and the initial capacity is just the capacity at the time the hash table was created. The load factor is an indicator of how much a complete hash table can be obtained before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the capacity approximately doubles, calling the renaming method.
Typically, the default load factor (.75) provides a good compromise between time and space. Higher values ββreduce the amount of overhead, but increase the cost of the search (reflected in most operations of the HashMap class, including get and put). The expected number of entries on the map and load factor should be taken into account when setting up the initial capacity to minimize the number of paraphrase operations. If the initial capacity is greater than the maximum number of records divided by the load factor, no rephrasing will ever occur.