Need an algorithm to create Google, like a word search

I will explain the problem here.

Suppose I have a list of 1000 words. Let's say this is a dictionary. The user enters a word, and it will correspond to an exact match if this word is correct or give the closest match. Like a Google search, as we introduce something, and this gives the closest match.

The algorithm that I thought

Read the word list one by one split our input word string into characters take the first word from the list and match character wise similarly do it for other words in the list 

I know this is a long way and it will take a long time. Does anyone know how to implement a better algorithm

+4
source share
3 answers
  • Sort words in an array
  • When a word goes into => binary search (log (n)) (we do this because if you use a hash table this will be useful for a direct match, but bad for adjacent ones)
  • If a perfect match returns it
  • Once again compute the levensthein distance of the requested word with adjacent words and their neighbors (to determine them) and add them to the return list (if they satisfy)
  • Returns a list of selected adjacent words

Quick and dirty implementation with /usr/share/dict/words (you still have to do the left and right parts and choices)

DISCLAIMER: Binary search code borrowed from http://www.perlmonks.org/?node_id=503154

 open(FILE, "<", "/usr/share/dict/words"); my @lines = <FILE>; my $word = $ARGV[0]; sub BinSearch { my ($target, $cmp) = @_; my @array = @{$_[2]}; my $posmin = 0; my $posmax = $#array; return -0.5 if &$cmp (0, \@array, $target) > 0; return $#array + 0.5 if &$cmp ($#array, \@array, $target) < 0; while (1) { my $mid = int (($posmin + $posmax) / 2); my $result = &$cmp ($mid, \@array, $target); if ($result < 0) { $posmin = $posmax, next if $mid == $posmin && $posmax != $posmin; if ($mid == $posmin){ return "Not found, TODO find close match\n"; } $posmin = $mid; } elsif ($result > 0) { $posmax = $posmin, next if $mid == $posmax && $posmax != $posmin; if ($mid == $posmax){ return "Not found, TODO find close match\n"; } $posmax = $mid; } else { return "Found: " .@array [$mid]; } } } sub cmpFunc { my ($index, $arrayRef, $target) = @_; my $item = $$arrayRef[$index]; $item =lc($item); $target =lc($target); $a = $item cmp $target; return $a; } print BinSearch($word."\n", \&cmpFunc, \@lines)."\n"; 

Usage (if the script is called find_words.pl ):

perl find_words.pl word

Where the word is the word you want to find.

+5
source

The general algorithm for such a search for "fuzzy" words is the Levenshtein distance . He does not find similar words, but calculates the similarity of words. This similarity score (or Levenshtein distance) can then be used by the sort or filter function to select similar words.

How distance is measured is simple: how many characters need to be changed from the target word to the matching word. For example, a distance of 3 means that the difference between the words is 3 edits (not necessarily characters, since it also includes the action of adding and removing characters).

The Rosetta Code website has a list of Levenshtein distance algorithms implemented in different languages, including tcl and perl: http://rosettacode.org/wiki/Levenshtein_distance

The tcler wiki has a page that discusses similarity algorithms that include several implementations of Levenshtein distance: similarities

There is also a CPAN module for perl that you can use: Text :: Levenshtein

So in perl you can just do:

 use Text::Levenshtein; my %word_distance; @word_distance{@dictionary} = distance($word,@dictionary); 

Then, go through the word_distance hash to find the most similar words.

+4
source

The problem with using a simple binary search to get a neighborhood of such words, and then using the Levenshtein algorithm to clarify, is that errors can occur both at the beginning of the word and at the end; you run the risk of completely losing words where there is an early mistake. A more efficient method might be to use the Soundex algorithm to create sort keys in a list of words so that you can search in the main similarity. You can then use Levenshtein to clarify, but weighing that similarity is measured by the rarity of words in the base source case; Assuming that users are more likely to want a common word than a rare one, this is a useful measure. (It is assumed that you have the source case, but if you want to emulate Google, then you definitely have one of them.)

Perhaps it would be better to see how to use some kind of map reduction mechanism to start a weighted metric of Levenshtein distance over the entire set of words. This is more like throwing equipment by problem, but avoids the problems associated with potential problems with words missing from the initial filter. Alas, this means that you will get something that cannot be pushed away as part of a simple piece of software (support systems for something like this are unlikely to be what you want to impose on a regular user), but deployment for service.

+2
source

Source: https://habr.com/ru/post/1502965/


All Articles