String Matching Algorithms Used by lucene

I want to know the string matching algorithms used by Apache Lucene. I was looking at the index file format used by lucene for here . it seems that lucene stores all the words found in the text, as well as their frequency of occurrence in each document. but as far as I know, in order to efficiently match strings, he will need to process the words occurring in the documents.

Example: searching "iamrohitbanga is a stackoverflow user" (use fuzzy matching)

in some documents.

it is possible that there is a document containing the string "rohit banga"

to find that the substrings rohit and banga are present in the search bar, it will use some effective substring.

I want to know what algorithm it is. also, if it does some preprocessing, a function call in java api launches it.

+4
source share
3 answers

Lucene's main construct uses exact string matches or determines equivalent strings using the Analyzer . The analyzer breaks the text into indexed tokens. During this process, it can match equivalent strings (for example, upper and lower case, string strings, removal of diacritics, etc.). The resulting markers are stored in the index as a dictionary plus a list of tokens in documents. Therefore, you can create and use the Lucene index without using a string matching algorithm such as KMP. However, FuzzyQuery and WildCardQuery use something like this, first looking for matching terms and then using them for complete matching. Please see Robert Muir Blog AutomatonQuery Report for a new, effective approach to this problem.

+1
source

As Yuval explained, overall Lucene focuses on exact matches (by normalizing terms with analyzers in terms of both index and query time).

The Lucene trunk code (a version not yet released) actually uses a suffix tree for inaccurate matches such as Regex, Wildcard, and Fuzzy.

How it works is that the Lucene dictionary dictionary is indeed a form of suffix tree. You can see this in the file formats you mentioned in several places:

Thus, if the text of the previous word was “bone” and the term “boy”, the Length prefix is ​​two, and the suffix is ​​“y”.

The term information index gives us "random access" by indexing this tree at specific intervals (every 128th term is the default).

Thus, a low-level tree is a suffix tree, but at a higher level, we use these properties (mainly those specified in IndexReader.terms for processing the dictionary as a determinate finite state machine (DFA):

Returns an enumeration of all members starting with this term. If the given term does not exist, then the enumeration is placed in the first member, exceeding the stated term. Enumeration is ordered using Term.compareTo (). Each member is larger than everything that precedes it in the listing.

Inaccurate queries such as Regex, Wildcard, and Fuzzy are themselves also defined as DFAs, and “matching” is just an intersection of DFAs.

+6
source

As you pointed out, Lucene only stores a list of terms that appear in documents. How Lutsen extracts these words is up to you. The lucene parser, by default, just breaks words separated by spaces. You can write your own implementation, which, for example, for the initial string "iamrohitbanga" gives 5 tokens: "iamrohitbanga", "i", "am", "rohit", "banga".

Please see the lucene API docs for the TokenFilter class.

0
source

Source: https://habr.com/ru/post/1300441/


All Articles