Fuzzy string search in Java, including word swaps

I start Java by trying to write a program that matches the input to a list of predefined strings. I looked at the Levenshtein distance, but I ran into such problems:

If I have an entry, such as a “beef filet,” I want it to match a “beef filet." The problem is that the “beef fillet” is closer, according to Levenshtein, to something like “tuna fillet,” which, of course, is wrong.

Should I use something like Lucene for this? Does Lucene use methods in a Java class?

Thanks!

+4
source share
3 answers

You need to calculate the relevance of your searches for input strings. Lucene has built-in relevance calculations, and this article may be a good start to understand them (I just looked through it, but it seems authoritative enough).

The basic process is as follows:

  • Initialization: Decrypt the search terms and save them in the HashSet s series, one per period. Or, if you want to give different weights to each word, use HashMap , where the word is the key.
  • Processing: tokenize each input line and check each of the sets of search terms to determine how they apply to the input. See above for a description of the algorithms.

There is a simple trick for handling spelling errors: during initialization, you create sets containing potential errors in search terms. Peter Norvig's entry on " How to Write a Spelling Corrector" "describes this process (it uses Python code, but Java implementation is certainly possible).

+2
source

Lucene supports fuzzy search based on Levenshtein distance.

https://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Fuzzy%20Searches

But lucene is designed to search through a set of documents, not to search for strings, so lucene may be redundant for you. Other Java implementations are available. Take a look at http://www.merriampark.com/ldjava.htm

+1
source

It should be possible to apply Levenshtein's distance to words, and not to symbols. Then, to match the words, you can apply Levenshtein again at the character level, so the “filet” in the “beef filet” should match the “filet” in the “beef filet”.

+1
source

Source: https://habr.com/ru/post/1347097/


All Articles