Hashing sequences of integers

Question

Hashing sequences of integers

I have to deal with sequences of numbers, where a sequence has the following properties:

Elements are integers,
the lengths of the sequences are variable and not fixed,
integers have an upper bound,
multiple occurrences of elements are allowed,
the order of the elements does not matter.

Given a sequence, I would like to know if this sequence has occurred, i.e. I want a hash sequence. For instance,

[2, 3, 6, 2, 13]

and

 [6, 3, 2, 13, 2]

must have the same hash value.

The programming language used is C.

I know that I can sort the sequences first and then store them in trie, which is definitely an option. However, what would be a suitable hash function for this purpose?

+4

hash

Guybrush May 22, '13 at 15:02

source share

2 answers

Ilmari karonen · Answer 1 · 2013-05-22T15:38:16+0000

Requirement that

the order of the elements does not matter

makes me immediately think of something like Zobrist hashing . That is, you will have a function f matching integers with random bistro strings, and your hash will be just XOR bit strings corresponding to the numbers in your sequence.

Of course, the basic Zobrist hashing described above does not satisfy your other requirements, which

multiple occurrences of elements are allowed

since the operation XOR is proper inverse (i.e., a XOR a = 0 for any a ). However, simply replacing XOR with some other ring operation without this property (which is really considered desirable in normal Zobrist hashing), for example n-bit add, should generate the hash as you want:

 unsigned int hash_multiset (int *seq, int n) { unsigned int h = 0; while (n--) h += f( *seq++ ); return h; }

(A small detail to mark this function is that if you want to truncate its output, then it is slightly better to use the upper than the lower bit. This is because if the k least significant bits of the hashes of the sequences [a] and [b] collide, then the least significant bits k [a, a] , [b, b] , [a, b] , etc. For k high bits this is not true, since the lower bits can transfer to the higher bits, creating a more "random" output.)

There are various ways to implement the function f . For a limited range of input integers, you can simply use a fixed random bit string lookup table. Alternatively, if you do not know the range of your inputs in advance, you can use a different (regular) hash table to match integers with random bitrates and simply create it on the fly.

Finally, it is also possible to implement f without a lookup table, simply using a fixed function that "looks random enough." One good option for such a feature would be to use a simple and quick block encipher , such as TEA or (in systems with hardware support) AES , with the result truncated to your preferred length.

Stig brautaset · Answer 2 · 2013-05-22T15:15:28+0000

How about multiplying all the numbers and the length of the sequence modulo some sufficiently large number? Here is some Scala code that shows the calculation:

 val l = List(6, 3, 2, 13, 2) (l.reduce(_ * _) * l.length) % 10000

The result is: 4680.

Obviously, this does not guarantee that if the hashes match, the sequences are unique. (This may not even be a very good approximation!) However, if the hashes do not match, this ensures that the sequences do not match.

Hashing sequences of integers

More articles: