String to Integer Hashing Function with precision

I want to use the hash char array for int or long. The resulting value must adhere to the specified accuracy value. The function I used is below:

int GetHash(const char* zKey, int iPrecision /*= 6*/) { /////FROM : http://courses.cs.vt.edu/~cs2604/spring02/Projects/4/elfhash.cpp unsigned long h = 0; long M = pow(10, iPrecision); while(*zKey) { h = (h << 4) + *zKey++; unsigned long g = h & 0xF0000000L; if (g) h ^= g >> 24; h &= ~g; } return (int) (h % M); } 

The string to be hashed is similar to "SAEUI1210.00000010_1".

However, in some cases, this leads to duplication of values. Are there any good alternatives that won't duplicate the same hash for different string values.

+4
source share
4 answers

The definition of a hash itself is that it creates duplicate values ​​for some values ​​due to the fact that the range of hash values ​​is smaller than the hashed data space.

In theory, a 32-bit hash has enough range to hash all ~ 6 character strings (only AZ, az, 0-9) without causing a collision. In practice, hashes are not an ideal rearrangement of input data. Given a 32-bit hash, you can expect to get hash collisions after hashing ~ 16 bits of random inputs due to a paradoxical birthday .

Given a static set of data values, it is always possible to construct a hash function created specifically for them that never collides with itself (of course, its output size will be at least log(|data set|) . You should know all possible data values ​​before time.This is called perfect hashing .

As they say, here are a few alternatives that should run you (they are designed to minimize conflicts)

+13
source

Each hash will have collisions. Period. This is called a happy birthday problem .

You might want to check for cryptographic features like MD5 (relatively fast, and you don't care if it's unsafe), but they will also have conflicts.

+2
source

Hashes generate the same value for different inputs - this is what they do. All you can do is create a hash function with enough distribution or bit depth (or both) to minimize these collisions. Since you have this additional accuracy limit (0-5?), You will be more likely to encounter collisions.

+2
source

MD5 or SHA , There are many open implementations, and the result is unlikely to lead to duplication of results.

+1
source

Source: https://habr.com/ru/post/1286394/


All Articles