Data structure for finding a string of fixed length

I have a set of strings in the form of keys. Sort of...

AAAA ABBA ACEA ALFG ... ... ZURF [AAA _JFS aKDJ 

All of them are a unique combination of 4 characters and have the same length. There are hundreds of thousands of them. I want to do a search and get the value associated with each row.

Currently, I implemented it as a hash table, but the main problem is collisions (I implemented all the strategies in the Wiki).

I am going to implement this as a tree of prefixes. Given the parameters, though (unique, fixed length), I wonder if there is a ready-made data structure that I can't think of that would be best suited for this ...

EDIT: In addition, all possible combinations are populated once with a data file. Subsequently, the search takes place at the posting speed.

+6
source share
3 answers

Since you know all the lines ahead of time, you can use gperf to create a perfect hash function that has no collisions. For example, with four AAAA ABBA ACEA ALFG input lines, it generated the following hash function (using the gperf -L ANSI-C input.txt command line):

 static unsigned int hash (register const char *str, register unsigned int len) { static unsigned char asso_values[] = { 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 7, 2, 5, 12, 12, 12, 12, 12, 12, 12, 12, 0, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12 }; return len + asso_values[(unsigned char)str[1]]; } const char * in_word_set (register const char *str, register unsigned int len) { static const char * wordlist[] = { "", "", "", "", "ALFG", "", "ABBA", "", "", "ACEA", "", "AAAA" }; if (len <= MAX_WORD_LENGTH && len >= MIN_WORD_LENGTH) { register int key = hash (str, len); if (key <= MAX_HASH_VALUE && key >= 0) { register const char *s = wordlist[key]; if (*str == *s && !strcmp (str + 1, s + 1)) return s; } } return 0; } 

This requires a single table scan, length comparison and row comparison. If you know for sure that the word you have hashed is one of your original words, you can skip string comparisons.

Expanding the input size from 4 to 10,000 randomly generated strings increases the hash function by only 4 lookup tables plus length comparisons and string comparisons. But, since string comparison should store every source line in it, it is output to a very large table in a compiled object file (1.4 MB). If you do not need to perform string comparisons, you can omit this table.

+5
source

The hash table, even in a collision, will surpass everything else, and you can customize it to reduce collisions.

+1
source

First wrap each line into an integer. If your alphabet contains 64 characters (for example), you can use 4 * 6 = 24 bits as keys.

Now, if you use more than half of the possible keys (as you say, there are hundreds of thousands of them), perhaps for a simple solution: just create an array, get it by index (an integer is displayed from the string).

If possible, implement this by allocating one memory. It can even save memory (memory wasted due to 100,000 small allocations).

0
source

Source: https://habr.com/ru/post/899248/


All Articles