Hash Table lookup - with perfect hash, in C

I have a C language application where I need to look up tables.

Entries are strings; they are all known at the beginning of execution. The table is initialized once, and then raised many times. The table may change, but basically as if the application is starting. I think this means that I can use the perfect hash? It is normal to use some time to initialize the hash table, as it happens only once.

There will be from 3 to 100,000 entries, each of which is unique, and I believe that 80% of cases will have less than 100 entries. In these cases, a simple, naive search is "fast enough." (== no one complains)

However, in cases where there are 10k + records, the search speed of the naive approach is unacceptable. What is a good approach to provide good hash table search performance for strings in C? Suppose I do not have a third-party commercial library such as Boost / etc. What hashing algorithm should I use? how can i decide

+6
source share
3 answers

Creating the perfect hash is not an easy problem. There libraries are dedicated to the task. In this case, the most popular is probably CMPH . I did not use it, although I can not help further. gperf is another tool, but compilation requires the lines to be known (you could get around it by compiling .so and loading, but kind of outwit).

But to be honest, I will at least try to go with binary search first. Just sort the array using qsort , then do a search using bsearch (or bsearch your own). Both of them are part of stdlib.h , since C89.

+4
source

If the naive (I assume that you mean linear) approach is approved for 100 records (therefore, on average 50 comparisons are performed), then a binary search will be more than sufficient for 100,000 records (this takes no more than 17 comparisons).

That way, I wouldn't worry about hashes, but just resorted to sorting your row table at startup (e.g. using qsort ) and later using binary search (e.g. using bsearch ) to search for entries.

+4
source

If the table size (maximum size) is known, a simple hash table with chain is very easy to implement. Overhead in size is just two ints per element. With a reasonable hash function, an average of only 1.5 samples is required per search, this is for a 100% loaded table.

Building an ideal hash is only possible if your data does not change. Once it changes, you will have to double-check and rephrase, which is much more expensive than making a few additional comparisons.

0
source

Source: https://habr.com/ru/post/896759/


All Articles