Simple spell check algorithm

I was assigned to create a simple spell check for the assignment, but they didn’t give any instructions, so I was wondering if anyone could help me. I don’t want anyone to complete the task for me, but any direction or help with the algorithm would be awesome! If what I ask is not within the guild of the site, then I'm sorry and I will look elsewhere. :)

A project loads correctly spelled words in lower case, and then has to make spelling suggestions based on two criteria:

  • The difference is in one letter (it is added or subtracted so that the word is the same as the word in the dictionary). For example, the “stack” will be the sentence “staick” and “cool” will be the sentence for “coo”.

  • Replacing a single letter. So, for example, “bad” would be a baud sentence.

So, just to make sure I correctly explained .. I could upload the words [hello, goodbye, fantasy, goodness, god], and then the sentences for the (misspelled) word "godd" would be [God].

Speed is my main consideration here, so although I think I know a way to get this job, I'm really not too sure how effective it is. The way I am going to do this is to create a map<string, vector<string>> , and then for each correctly written word that has loaded, add the correctly written work as a key on the map and fill the vector to be all possible "wrong "permutation of the word.

Then, when I want to find a word, I will look at each vector on the map to see if this word is a permutation of one of the correctly spelled words. If so, I will add a key as a spelling suggestion.

It seems like it will take HEAPS memory, because there will surely be thousands of permutations for each word? It also seems that it would be very slow if my initial dictionary of correctly spelled words was large?

I thought that perhaps I could reduce the time by simply looking at the keys that look like the word I'm looking at. But then again, if they are similar in some way, this probably means that the key will be the suggestion that I don't need all these permutations!

So yes, I'm a little fixated on which direction I should look. I would really appreciate any help, since I'm really not sure how to rate the speed of various ways of doing things (we don’t have this taught at all in the classroom).

+6
source share
4 answers

An easier way to solve the problem is with a really calculated map [bad word] → [sentences].

The problem is that although deleting a letter creates a few “bad words," you have many candidates to add or replace.

So, I would suggest another solution;)

Note: The edit distance you are describing is called Levenshtein distance.

The solution is described in a step-by-step step, usually the search speed should constantly improve with each idea, and I tried to organize them first with the help of simple ideas (in terms of implementation). Feel free to stop when you are comfortable with the results.


0. Preliminary

  • Implement the Levenshtein distance algorithm
  • Store the dictionaries in a sorted sequence (e.g. std::set , although sorting std::deque or std::vector would be more efficient in terms of performance)

Keys:

  • An array is used in calculating the Levenshtein distance, at each step the next line is calculated only with the previous line
  • The minimum distance in a row always exceeds (or equals) the minimum value in a previous row

The last property allows you to implement a short circuit: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current line exceeds 2, you can refuse to calculate. A simple strategy is to return + 1 as the distance.


1. The first preliminary

Let it begin simply.

We implement a linear scan: for each word we calculate the distance (short circuit), and we list those words that have reached the shorter distance so far.

It works great on small dictionaries.


2. Improving data structure

Levenshtein distance is not less than the difference in length.

Using a pair (length, word) as a key instead of a simple word, you can limit your search to the length range [length - edit, length + edit] and significantly reduce the search space.


3. Prefixes and trimming

To improve this, we can notice that when we construct the distance matrix, in turn, one world is completely scanned (the word we are looking for), and the other (referent) is not: we use only one letter for each line.

This very important property means that for two referents who have the same initial sequence (prefix), then the first rows of the matrix will be identical.

Remember that I ask you to keep the dictionnary sort? This means that words that have the same prefix are contiguous.

Suppose that you check your word on cartoon , and in car you understand that it does not work (the distance is already too long), then any word starting with car will also not work, you can skip words while they start with car .

The slip itself can be performed either linearly or using a search (find the first word with a higher prefix than car ):

  • linear works best if the prefix is ​​long (a few words to skip)
  • binary search is best for short prefix (many words for omissions)

How long the “long” depends on your vocabulary and you will have to measure. I would go with binary search to start with.

Note. Length split works with prefix split, but it reduces much more search space


4. Prefixes and reuse

Now we will also try to reuse the calculation as much as possible (and not just the result "does not work")

Suppose you have two words:

  • cartoon
  • car wash

First you compute the matrix, line by line, for cartoon . Then when reading carwash you need to determine the length of the general prefix (here car ), and you can save the first 4 rows of the matrix (corresponding to void, c , a , r )).

Therefore, when you start calculating carwash , you actually start the iteration with w .

To do this, simply use the array allocated right at the beginning of your search and make it large enough to accommodate a large link (you should know what is the longest in your dictionary).


5. Using the “best” data structure

To make working with prefixes easier, you can use Trie or Tree Patricia to store the dictionary. However, this is not an STL data structure, and you will need to increase it to store in each subtree the range of lengths of words that are stored, so you have to do your own implementation. It is not as simple as it seems, because there are problems with breaking the memory that can kill locality.

This is the last option. It is expensive.

+6
source

You should take a look at this explanation from Peter Norwig on how to write a spelling corrector.

How to write a spelling corrector

EveryThing explains well in this article, as an example, the python code for spelling is as follows:

 import re, collections def words(text): return re.findall('[az]+', text.lower()) def train(features): model = collections.defaultdict(lambda: 1) for f in features: model[f] += 1 return model NWORDS = train(words(file('big.txt').read())) alphabet = 'abcdefghijklmnopqrstuvwxyz' def edits1(word): splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [a + b[1:] for a, b in splits if b] transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1] replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b] inserts = [a + c + b for a, b in splits for c in alphabet] return set(deletes + transposes + replaces + inserts) def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS) def known(words): return set(w for w in words if w in NWORDS) def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get) 

Hope you can find what you need on the Peter Norwig website.

+4
source

To check spelling, it would be useful to use many data structures, such as BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I performed the implementation for the same here

My earlier code link may be misleading, this option is suitable for spelling corrector.

+2
source

from the top of the head, you can separate your sentences based on length and build a tree structure in which the children are longer versions of the shorter parent.

should be pretty fast, but I'm not sure about the code itself, I'm not very good at C ++

0
source

Source: https://habr.com/ru/post/899520/


All Articles