What property of the bitmap causes a collision?

Question

What property of the bitmap causes a collision?

I read about the Java approach for randomizing hash keys here
Apparently, the idea is to make sure the low bits are “random” to help the allocation, but I'm trying to figure this out more.
So, if we have a table of size 10, then the numbers are 0,10,20,30,40, etc. Everyone falls into bucket 0, numbers 1,11,21,31, etc. Fall into bucket 1, etc. (Using modulo 10). <w> Therefore, by changing the bit patterns, you can force them to go into different buckets instead of going to bucket 0.
But what I don’t quite understand about is a property that affects a lower order bit, and we need to randomize them. So we have:

0000 0000 (0) 0000 1010 (10) 0001 0100 (20) 0001 1110 (30) 0010 1000 (40)

What is the pattern in low order bits that makes them fit into the same slot?
Maybe I'm confused about the following? I understand that this is some pattern in the low-order bits causing collisions, and we are trying to randomize the bit to compensate

+6

java hashmap algorithm hash-collision hash

Jim Jun 04 '15 at 18:34

source share

2 answers

rici · Answer 1 · 2015-06-04T19:04:50+0000

Some hash functions do a poor job of randomizing low bits.

One classic case is the use of hardware addresses as a hash for references to objects ("pointers" in the C language), which would otherwise be a reasonable way to cheaply get a unique number for an object identifier. This will work fine if the hash table bucket count is a prime number, but for hash implementations where the number of buckets is always 2, the fact that all hashes are divisible by 8 will mean that most buckets were empty.

This is an extreme case, but at any time when the data to be hashed is not evenly distributed, and the hash function tends to store low bits, you will find some bias in the bucket assignments.

Joni · Answer 2 · 2015-06-04T19:28:22+0000

Java HashMap uses hash tables whose size is two. If you use the stop / module operation as a compression function, as usual, you end up taking the low-order bits of the hash code as the bucket index. If the hash codes are multiple of a power of two, some of the least significant bits will always be zero, and you end up using the fraction of available buckets.

A specific example: suppose you have 32 buckets and hash codes are multiples of 8. The table uses only the 5 least significant bits of the code, and 3 of them are always equal to 0. Therefore, only 2 bits define the bucket and you use only 4 of 32 buckets:

 XXXXXX00000 -> bucket 0 XXXXXX01000 -> bucket 8 XXXXXX10000 -> bucket 16 XXXXXX11000 -> bucket 24

Fortunately, this is not the case in Java, because the HashMap does not use only the least significant bits of the hash code: it scrambles the bits, so it is not so easy to create bad scripts by accident. Here is an excerpt from the OpenJDK HashMap implementation:

 /** * Applies a supplemental hash function to a given hashCode, which * defends against poor quality hash functions. This is critical * because HashMap uses power-of-two length hash tables, that * otherwise encounter collisions for hashCodes that do not differ * in lower bits. Note: Null keys always map to hash 0, thus index 0. */ static int hash(int h) { // This function ensures that hashCodes that differ only by // constant multiples at each bit position have a bounded // number of collisions (approximately 8 at default load factor). h ^= (h >>> 20) ^ (h >>> 12); return h ^ (h >>> 7) ^ (h >>> 4); }

What property of the bitmap causes a collision?

More articles: