Why am I getting so many collisions in my usual closed hash?

I have my own private-hashset / public addressing class (i.e. non-linked lists). It is very specific to my needs - it is not general (only for positive long numbers), it requires the number of inserted records to be predetermined and does not support deletion - but it should be as small as possible.

Since it has so little functionality, it is really a small and simple class. However, for some reason, when I insert a lot of records, the number of collisions gets too high too fast.

Some code (Java):

public class MyHashSet { private long[] _entries; public MyHashSet(int numOfEntries) { int neededSize = (int)(numOfEntries / 0.65D); _entries = new long[neededSize]; } public void add(long num) { int cell = ((Long) (num % _entries.length)).intValue(); while (_entries[cell] != 0) { if (++cell >= _entries.length) cell = 0; } _entries[cell] = num; } ... 

I have a main object that initializes a MyHashSet object with 10 million as a parameter, and then calls add () 10 million times with another randomly generated (but positive) long number. Although on a regular Java HashSet, this insert takes about a second overall, it takes about 13 seconds to complete work with MyHashSet. I added a counter for collisions, and indeed, the number of collisions by 3-6 billion is more than expected (I would assume that about 30-40 million is expected).

Am I doing something wrong? Is there something wrong with hashing? Why were there so many clashes, and what can I do about it?

Thanks!

PS: The number 0.65 in the code means that the table will be filled only by 65%, which, as I know, should work well in closed hash networks. For this, even if I set it to 20%, the insertion still takes> 10 seconds.

- EDIT -

This is pretty tricky, but my test code recreated a random object (with System.currentTimeMillis () as a seed) on each iteration of the loop, instead of using the same one for the entire run.

After fixing, it takes about 2-3 seconds to insert. This still seems too big in comparison - why does java HashSet only need a second to insert by default when it is more complex than MyHashSet? Now I get only about 9 million collisions. I also tried disabling the registration code to see if it helps, but it still won't change the situation. I would appreciate any ideas and apologize again for the confusion.

+4
source share
2 answers

The first thing I notice is free boxing on the line

 int cell = ((Long) (num % _entries.length)).intValue(); 

which is much slower than

 int cell = (int) (num % _entries.length); 

(Note that num % _entries.length will always match int , since _entries.length itself is int .)

It’s clear that Java HashSet will suffer from such overhead anyway, but this is at least one obvious thing to fix.

Also, it is probably in your best interest to make sure the table size is a prime number. The easiest way to do this is BigInteger.valueOf((int)(numOfEntries / 0.65)).nextProbablePrime().intValue() , and since this is a one-time cost, this should not greatly affect overall performance.

As an alternative, Java HashSet uses hash table sizes with a capacity of 2, so it can use a mask ( value & (_entries.length - 1) , mostly), not % , which is often more expensive.

+3
source

First: fix your modulo function. Otherwise, you will get ArrayOutOfBounds exceptions, and it's easy to fix them without real value (just an and). Also, if you are up to it, do what Louis offers, and get rid of the useless long throw.

Anyway, the real problem is that you are using the awful next function if the cell is already taken. Linear sounding is usually a bad idea, and then you even exacerbate it in only one direction. If your numbers are not evenly distributed, you will encounter many collisions. Double hashing works pretty well in practice, but you can also fix linear research and test if that helps.

Then you should either use a prime number for the size of the table, as Louis offers, which has some (theoretically provable) advantages, but is slower or uses the following power 2. At the moment, you combine the disadvantages of both approaches.

+1
source

Source: https://habr.com/ru/post/1403473/


All Articles