Independent-order hashing of a set of integers

I want to hash a set of integers so that the order of integers does not affect the calculated value of the hash function. those. H([32224,12232,564423]) == H([564423,32224,12232]) .

The number of unique sets will be within a few million. Speed ​​is very important , but I need to know the top in collisions with the chosen approach.

Wikipedia has a good section in hashing vectors , but I don’t understand the math behind it to confidently implement them in code. I would appreciate it if anyone could explain the math associated with some kind of code. Ideally, I would like the final hash to be 32 bits. If applicable, I will implement this in Java.

Update . I try to avoid sorting integers in a set due to performance reasons (working on a lot of such sets).

+6
source share
5 answers

A simple approach is to xor or add hashes of individual integers. xor and add are commutative, so this satisfies order independence.

In this way:

 int hc = 0; for(int i = 0; i < n; i++) { hc += a[i]; } return hc; 

or

 int hc = 0; for(int i = 0; i < n; i++) { hc ^= a[i]; } return hc; 

since int hash is its value.

In fact, this is exactly what the HashSet<Integer>.hashCode (uses add) will do. If your integers are already inserted in the box or you can handle their boxing, this is an integrated solution.

+5
source

Assuming you need speed without the overhead of *Set classes, you can write H like this:

 /** * Hashes a set of integers. * * @param list to hash * @return hash code */ public static int H(int list[]) { // XOR all the integers together. int hashcode = 0; for (int val : list) { hashcode ^= val; } return hashcode; } 

It is the same regardless of order, and it is relatively effective.

For instance:

 public static void main(String[] args) { System.out.println(Integer.toHexString(H(new int[]{0xabcd,0x1234,0x1111}))); System.out.println(Integer.toHexString(H(new int[]{0x1234,0x1111,0xabcd}))); } 

Output:

 a8e8 a8e8 

This could only be generalized to int by doing the following:

 /** * Hashes a set of objects. * * @param list to hash * @return hash code */ public static int H(Object list[]) { // XOR all the hashes together. int hashcode = 0; for (Object val : list) { hashcode ^= val.hashCode(); } return hashcode; } 

Then the main program should use Integer arrays instead of the int primitive.

Adding numbers should be just as fast and can give you a better distribution over the 32-bit range. If the elements of the set are already evenly distributed over the range, then xor might be better.

However, with both methods, you can easily create collisions with integers. For example, using the add method;

 {1000, 1001, 1002} {0, 1, 3002} 

Both of these arrays have the same H() .

Using the XOR method;

 {0x1010, 0x0101} {0x1111, 0x0000} 

Both of them have the same H() .

Similarly, element 0 problematic since lists will have the same hash with or without it. You can reduce this by adding a constant value at each iteration. For instance:

  ... hashcode += val.hashCode() + CONSTANT; ... 

Or by including the number of elements as the source hash code:

  ... // XOR all the hashes together. int hashcode = list.length; ... 
+2
source

You can put all integers in a Java HashSet and use its hashCode.

On the other hand, java.util.Set sets the following in documents:

Returns the hash code value for this set. The hash code of the set is defined as the sum of the hash codes of the elements in the set , where the hash code of the zero element is determined to be zero. This ensures that s1.equals (s2) implies that s1.hashCode () == s2.hashCode () for any two sets s1 and s2, as required by the general contract Object.hashCode ().

And Integer.hashCode () then

the hash code value for this object, equal to the first int value represented by this Integer object.

Thus, hashCode for a set of integers i1, i2, ... i_n in the Java standard library i1 + i2 + ... + i_n .

In case the numbers are quite small, you can also multiply each element by a number of primes of suitable size. Knut used 2654435761, which is too big for java int, but you can take its 2-padding, -1640531527. So take C = -1640531527 and then your code is C*i1 + C*i2 + ... C*i_n .

 private static final int C = -1640531527; public static int calculateHash(int[] set) { int code = 0; for (int e: set) { code += C * e; } return code; } 

However, there is one obvious flaw in thinking. To use a hash code, you must be able to prove that the 2 sets are really equal, so anyway the easiest way to prove is to sort the elements. Of course, if there are significantly fewer than millions of sets, then there are also few conflicts.

+1
source

I would prefer summing over xoring, because 1) the amount is used in the implementation of Set hashCode (), 2) sum, since the approach to hashing the array is recommended in Effective Java 3) it is less prone to conflict. I suggest you take a look at the openjdk AbstractSet implementation: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/java/util/AbstractSet.java?av=f

 120 public int hashCode() { 121 int h = 0; 122 Iterator<E> i = iterator(); 123 while (i.hasNext()) { 124 E obj = i.next(); 125 if (obj != null) 126 h += obj.hashCode(); 127 } 128 return h; 129 } 

I also recommend making h long and returning (int) ((h & 0xffffffffL) & h >>> 32))

+1
source

This is by no means trivial programming, but you can breathe inspiration from the S-blocks of DES algorithms: with this you can achieve a good dispersion function that compares close integers to very heterogeneous ones. Then XOR-these dissimilar integers should no longer be a threat due to collisions.

0
source

Source: https://habr.com/ru/post/950915/


All Articles