Sometimes the best way to find out is to just run some brute force tests on your ranges. Ultimately, you can always write a hash function and go back and fix it later if you get poor performance. Premature optimization is evil. However, hashing is easy to verify.
I ran this program and got 0 collisions:
import java.util.HashMap; import java.util.Map; import java.util.Map.Entry; public class Testing { public static void main(String[] args) { int minX = 0; int minY = 100000; int maxX = 20; int maxY = 2000000; Map<Integer, Integer> hashToCounts = new HashMap<Integer, Integer>(); for (int x = minX; x < maxX; x++) { for (int y = minY; y < maxY; y++) { int hash = hash(x, y); Integer count = hashToCounts.get(hash); if (count == null) count = 0; hashToCounts.put(hash, ++count); } } int totalCollisions = 0; for (Entry<Integer, Integer> hashCountEntry : hashToCounts.entrySet()) if (hashCountEntry.getValue() > 1) totalCollisions += hashCountEntry.getValue() - 1; System.out.println("Total collisions: " + totalCollisions); } private static int hash(int x, int y) { return 7 + y * 31 + x * 23; } }
And the conclusion:
Total Collisions: 0
Please note that my function was 7 + y * 31 + x * 23 .
Of course, don't take my word for it. Stir ranges to tune it to your dataset and try to calculate it yourself.
Using (y * 31) ^ x gave me:
Total Collisions: 475,000
And using only x * y :
Total Collisions: 20439,039
Be warned that this program can use a pretty good chunk of memory and processing power. I ran it on a fairly powerful server. I do not know how this will work on the local machine.
Some good rules for hashing:
- Mix your operators. By mixing your operators, you can make the results change more. Using just
x * y in this test, I had a very large number of collisions. - Use prime numbers for multiplication. Prime numbers have interesting binary properties that lead to more unstable multiplication.
- Avoid using shift operators (unless you really know what you are doing). They insert many zeros or ones into the binary number, reducing the volatility of other operations and potentially even reducing the number of possible outputs.
source share