Efficient hash function for low entropy alphanumeric strings

I am trying to write a (perfect) hash table to compress the mapping from unicode code names to their code point number (mapping the second column to the first column). As you can see, the possible inputs are very limited, in fact there are exactly 38 characters in the alphabet: AB...YZ , 0...9 , - and a space. In addition, there are many repetitions (substrings), DIGIT ZERO , DIGIT ONE , ..., LATIN CAPITAL LETTER A , LATIN CAPITAL LETTER B , etc.

The ideal hash table is computed by selecting the seed S , and then trying to build the perfect section of the hash table (in some way) by a hash on S If the table cannot be executed, it retries with a new seed. Having a large number of collisions usually requires more attempts, since it is becoming increasingly difficult for an algorithm to do everything.

The result of this is my entry domain with low entropy, and it takes a lot of effort to create a table with simple hash functions like DJB2; better hashes like FNV work pretty well, but more complex and slower features like SipHash tend to require even less repetition.

Since this is completely static and precalculated, I don’t care too much about quality for quality (i.e., safety and the probability distribution for arbitrary input at runtime doesn’t matter), but higher quality functions reduce the pre-compression time required for the other way around. allows you to achieve higher compression at a certain fixed time.

Question : Are there efficient published hash functions configured for input with such domain restrictions? That is, are there hash functions that use an additional structure to perform fewer operations, but still achieve a reasonable result?

I was looking for things like an “alphanumeric hash function”, but the results are not related (basically, only generating an alphanumeric string as the output of the hash function); even some guidance on the correct jargon so that I can search for documents would be helpful.

(This question is motivated by the fact that it is a little interesting to solve, and not really necessary.)

+5
source share
1 answer

I am trying to write a (perfect) hash table ...

If you want a perfect hash function, I would generate it with something like CMPH. This may be part of the static lookup table behind the scenes.

Alternatively, you can use a non-hash based approach, for example with DAWG or some similar Trie structure (and some Aho-Corasick on top?).

DAWG provides fairly compact storage and a quick search for strings for numbers. My guess is that she is likely to beat a hash table for your problem.

See http://www.wutka.com/dawg.html for some introduction. Versions are available in several languages.

0
source

Source: https://habr.com/ru/post/1201228/


All Articles