The fastest way to find vocabulary strings in text

I have a text file and a dictionary. The dictionary consists of a list of exactly 8-digit long words. I look through a text file and look through the dictionary every 8 characters ("sliding window").

I am currently using the python dictionary data structure as a lookup table. He amortized the search time 0 (1), but I wonder if there are faster algorithms / data structures that use the specific nature / structure of the problem.

+4
source share
2 answers

You can try aho-corasick several template templates. It creates a finite state machine with the first and very first search for the first attachment of the longest prefix, which is also the suffix of the dictionary string. You can try my php implementation at https://phpahocorasick.codeplex.com . It also enhances the wildcard search algorithm.

+1
source

I think you can use full-text search to do this, e.g. Apache Sorl, Elastich Search.

But you can use http://lunrjs.com/ for the client side.

0
source

Source: https://habr.com/ru/post/1599063/


All Articles