Assuming the database is pretty static, use a color filter . This is a degenerate form of a hash table that stores only bits indicating the presence of a value, without storing the value itself. This is probable, because hashes can collide, so every hit requires a full search. But a 1 MB flowering filter with 500,000 entries can have at least 0.03% false positives.
Some math: To get this low speed, up to 23 hash codes are required, each of which has 23 bits of entropy, a total of 529 bits. Bob Jenkins 64-bit hash function generates 192 bits of entropy in a single pass (if you use unregistered variables in hash() , which Bob quotes as possibly as “mediocre” hash), which requires no more than three passes. Due to how flowering filters work, you don’t need all the entropy for each request, as most search queries will report a skip before moving on to the 23rd hash code.
EDIT: You will obviously have to parse words from the text. A search for each instance of /\b\w+\b/ is likely to be performed for the first version.
To match phrases, you will need to test every subsequence of an n-word (aka n-gram), where n is any number from 2 to the largest phrase in your dictionary. You can do it much cheaper by adding any word that appears in the phrase to a separate flowering filter and only testing n-grams for which each word passes this second filter.
source share