Algorithm for hash / crc disordered multiset

Question

Algorithm for hash / crc disordered multiset

Let's say I would like to create an unordered set of unordered multisets unsigned int. To do this, I need to create a hash function to calculate the hash of an unordered multiset. In fact, it should also be useful for CRC.

One obvious solution is to put the elements in a vector, sort them and return a hash of the result. It seems to work, but it is expensive.

Another approach is xor values, but obviously, if I have one element twice or none, the result will be the same - which is not good.

Any ideas how I can implement this cheaper - I have an application that will make this thousand for thousands of sets and relatively large.

+5

c ++ algorithm crc hash

gsf Apr 9 '16 at 17:24

source share

3 answers

Mark adler · Answer 1 · 2016-04-10T03:46:27+0000

Since this is a multiset, you want the hash value to be the same for the same multisets, the presentation of which can have the same elements, presented, added or deleted in a different order. Then you would like the hash value to be commutative, easily updated, and changed for each change in the elements. You would also like the two changes to not easily reverse their effect on the hash.

One operation, which corresponds to all but the last, is an addition. Just summarize the elements. To save a limited amount, sum the amount to the size of your hash value. (For example, modulo 2 ⁶⁴ for a 64-bit hash.) To make sure that inserting or deleting null values changes the hash, first add them to each value.

The disadvantage of this amount is that the two changes can be easily undone. For instance. replacing 1 3 with 2 2. To solve this, you can use the same approach and summarize the polynomial of entries, while maintaining commutativity. For instance. instead of summing x + 1, you can summarize x ² + x + 1. Now it’s harder to come up with a lot of changes with the same amount.

Alex reinking · Answer 2 · 2016-04-10T04:28:07+0000

Here's a reasonable hash function for std::unordered_multiset<int> , it would be better if the calculations were done modulo a large prime, but the idea is worth it.

 #include <iostream> #include <unordered_set> namespace std { template<> struct hash<unordered_multiset<int>> { typedef unordered_multiset<int> argument_type; typedef std::size_t result_type; const result_type BASE = static_cast<result_type>(0xA67); result_type log_pow(result_type ex) const { result_type res = 1; result_type base = BASE; while (ex > 0) { if (ex % 2) { res = res * base; } base *= base; ex /= 2; } return res; } result_type operator()(argument_type const & val) const { result_type h = 0; for (const int& el : val) { h += log_pow(el); } return h; } }; }; int main() { std::unordered_set<std::unordered_multiset<int>> mySet; std::unordered_multiset<int> set1{1,2,3,4}; std::unordered_multiset<int> set2{1,1,2,2,3,3,4,4}; std::cout << "Hash 1: " << std::hash<std::unordered_multiset<int>>()(set1) << std::endl; std::cout << "Hash 2: " << std::hash<std::unordered_multiset<int>>()(set2) << std::endl; return 0; }

Output :

 Hash 1: 2290886192 Hash 2: 286805088

When it is a prime p, the number of collisions is proportional to 1 / p. I do not know what the analysis of two degrees is. You can make hash updates efficient by adding / subtracting BASE ^ x when you insert / remove the integer x.

Stefan haustein · Answer 3 · 2016-04-09T22:11:00+0000

Implement the internal multiset as a hash map value-> count.

This will allow you to avoid the problem that an even number of elements is canceled through xor as follows: instead of sorting each element, you build a new number from a number and value (for example, by multiplying them), and then you can build a complete hash using xor.

Algorithm for hash / crc disordered multiset

More articles: