Signed to a positive almost perfect hash

Question

Signed to a positive almost perfect hash

I have an integer type, say long , whose values are between Long.MIN_VALUE = 0x80...0 (-2 ^ 63) and Long.MAX_VALUE = 0x7f...f (2 ^ 63 - 1). I want a hash with ~ 50% collision with a positive integer of the same type (i.e. between 1 and Long.MAX_VALUE ) in a clean and efficient way.

My first attempts were something like this:

Math.abs(x) + 1
(x & Long.MAX_VALUE) + 1

but these and similar approaches always have problems with certain values, i.e. when x is 0 / Long.MIN_VALUE / Long.MAX_VALUE . Of course, the naive solution is to use 2 if statements, but I'm looking for something cleaner / shorter / faster. Any ideas?

Note. Suppose I work in Java, where there is no implicit conversion to logical and shift semantics are defined.

+6

java long-integer bit-manipulation hash perfect-hash

eold Jul 11 '12 at 5:59

source share

9 answers

Evgeny Kluev · Answer 1 · 2012-07-22T11:57:51+0000

The easiest approach is to zero out the sign bit, and then match zero to some other value:

 Long y = x & Long.MAX_VALUE; return (y == 0)? 42: y;

It is simple, it uses only one if / trernary operator and gives an average collision speed of ~ 50%. There is one drawback: it maps 4 different values (0, 42, MIN_VALUE, MIN_VALUE + 42) to one value (42). Thus, for this value we have 75% collisions, while for other values - exactly 50%.

It may be preferable to distribute collisions more evenly:

 return (x == 0)? 42: (x == Long.MIN_VALUE) ? 142: x & Long.MAX_VALUE;

This code gives 67% collisions for 2 values and 50% for other values. You cannot distribute collisions more evenly, but you can choose these 2 most counter values. The downside is that this code uses two ifs / trernary statements.

You can avoid 75% collisions with a single value by using only one if / ternary statement:

 Long y = x & Long.MAX_VALUE; return (y == 0)? 42 - (x >> 7): y;

This code gives 67% collisions for 2 values and 50% collisions for other values. There is less freedom to choose these most conflicting values: 0 cards up to 42 (and you can choose almost any value instead); MIN_VALUE matches 42 - (MIN_VALUE >> 7) (and you can switch MIN_VALUE to any value from 1 to 63, just make sure A - (MIN_VALUE >> B) does not overflow).

You can get the same result (67% collisions for 2 values and 50% collisions for other values) without conditional statements (but with more complex code):

 Long y = x - 1 - ((x >> 63) << 1); Long z = y + 1 + (y >> 63); return z & Long.MAX_VALUE;

This gives 67% collisions for the values "1" and "MAX_VALUE". If it’s more convenient to get the majority of collisions for some other values, just apply this algorithm to x + A , where “A” is any number.

An improved version of this solution:

 Long y = x + 1 + ((x >> 63) << 1); Long z = y - (y >> 63); return z & Long.MAX_VALUE;

mikera · Answer 2 · 2012-07-19T04:38:44+0000

Assuming you want to collapse all values into positive space, why not just mark the signed bit?

You can do this with a single bitwise op, taking advantage of the fact that MAX_VALUE is just a bit with a null icon, followed by, for example,

 int positive = value & Integer.MAX_VALUE;

Or for lengths:

 long positive = value & Long.MAX_VALUE;

If you need a “best” hash with pseudo-random qualities, you probably want to transfer the value using another hash function. My favorite fast hashes are the XORshift family of George Marsaglia. They have the wonderful property that they completely map the entire int / long number space onto themselves, so you can get exactly 50% of collisions after zeroing the sign bit.

Here is a quick implementation of XORshift in Java:

 public static final long xorShift64(long a) { a ^= (a << 21); a ^= (a >>> 35); a ^= (a << 4); return a; } public static final int xorShift32(int a) { a ^= (a << 13); a ^= (a >>> 17); a ^= (a << 5); return a; }

Gene · Answer 3 · 2012-07-22T19:58:03+0000

From the theoretical information representation, you have the values 2^64 to display in the values 2^63-1 .

Thus, the mapping is trivial with the module operator, since it always has a non-negative result:

 y = 1 + x % 0x7fffffffffffffff; // the constant is 2^63-1

It can be quite expensive, so what else is possible?

Simple math 2^64 = 2 * (2^63 - 1) + 2 says that we will have two source values for one target value, except for two special cases where three will go to one. Think of it as two special 64-bit values, name them x1 and x2 so that everyone shares the target with two other source values. In the mod expression above, this happens by "wrapping". The target values y=2^31-2 and y=2^31-3 have three mappings. Everyone else has two. Since in any case, we should use something more complex than mod , let's look for a way to match special values wherever we like, at a low cost

For illustration, let's work with matching a 4-bit signed int x in [-8..7] to y in [1..7], and not with 64-bit space.

An easy course is to have the values of x in [1..7] for yourself, then the task is reduced to mapping x in [-8..0] to y in [1 .. 7]. Note that there are 9 initial values and only 7 goals, as discussed above.

There are many strategies. At this point, you are likely to see gazillion. I will describe only one that is especially simple.

Let y = 1 - x for all values, except in special cases x1 == -8 and x2 == -7 . So the whole hash function becomes

 y = x <= -7 ? S(x) : x <= 0 ? 1 - x : x;

Here S(x) is a simple function that says where x1 and x2 displayed. Choose S based on what you know about the data. For example, if you think that high target values are unlikely, compare them with 6 and 7 with S(x) = -1 - x .

Final display:

 -8: 7 -7: 6 -6: 7 -5: 6 -4: 5 -3: 4 -2: 3 -1: 2 0: 1 1: 1 2: 2 3: 3 4: 4 5: 5 6: 6 7: 7

Taking this logic up to 64-bit space, you will have

 y = (x <= Long.MIN_VALUE + 1) ? -1 - x : x <= 0 ? 1 - x : x;

Many other settings are possible in this structure.

Durandal · Answer 4 · 2012-07-23T15:59:24+0000

I would choose the simplest, but not completely time-consuming version:

 public static long postiveHash(final long hash) { final long result = hash & Long.MAX_VALUE; return (result != 0) ? result : (hash == 0 ? 1 : 2); }

This implementation pays for one conditional operation for all but two possible inputs: 0 and MIN_VALUE. These two are assigned different value comparisons with the second condition. I doubt that you will get the best combination of (code) simplicity and (computational) complexity.

Of course, if you can live with worse distribution, it becomes much easier. Limiting the space to 1/4 instead of 1/2 -1, you can get:

 public static long badDistribution(final long hash) { return (hash & -4) + 1; }

aRestless · Answer 5 · 2012-07-24T09:39:21+0000

If the value is positive, it can probably be used directly, otherwise invert all the bits:

 x >= 0 ? hash = x : hash = x ^ Long.MIN_VALUE

However, you should slightly adjust this value if the x values are correlated (which means: similar objects produce similar values for x ), possibly with

 hash = a * (hash + b) % (Long.MAX_VALUE) + 1

for some positive constants a and b , where a should be quite large and b prevents the mapping 0 to 1 . It also displays all this in [1, Long.MAX_VALUE] instead of [0, Long.MAX_VALUE]. By changing the values for a and b , you can also implement more complex hash functions, such as hash cooko , which need two different hash functions.

Such a solution must necessarily be preferable, and not one that provides a "strange collision distribution" for the same values every time it is used.

Harry johnston · Answer 6 · 2012-07-25T04:14:00+0000

You can do this without any conventions and in a single expression using the unsigned shift operator:

 public static int makePositive(int x) { return (x >>> 1) + (~x >>> 31); }

Hounshell · Answer 7 · 2012-07-11T06:38:47+0000

Just to make sure you have a long one and want to hash it to an int?

You can do...

 (int) x // This results in a meaningless number, but it works (int) (x & 0xffffffffl) // This will give you just the low order bits (int) (x >> 32) // This will give you just the high order bits ((Long) x).hashcode() // This is the high and low order bits XORed together

If you want to save a long time, you can do ...

 x & 0x7fffffffffffffffl // This will just ignore the sign, Long.MIN_VALUE -> 0 x & Long.MAX_VALUE // Should be the same I think

If getting 0 does not fit ...

 x & 0x7ffffffffffffffel + 1 // This has a 75% collision rate.

Just out loud ...

 ((x & Long.MAX_VALUE) << 1) + 1 // I think this is also 75%

I think you will need to either be okay with 75% or a little ugly:

 (x > 0) ? x : (x < 0) ? x & Long.MAX_VALUE : 7

ErikE · Answer 8 · 2012-07-25T23:59:28+0000

It seems the simplest of all:

 (x % Long.MAX_VALUE) + 1

I would be interested to compare the speeds of all the above methods.

Roger Johnson · Answer 9 · 2012-07-26T04:18:53+0000

Simple And your input value with Long.MAX_VALUE and OR since 1. Nothing more is needed.

Example:

 long hash = (input & Long.MAX_VALUE) | 1;

Signed to a positive almost perfect hash

More articles: