An easier way to solve the problem is with a really calculated map [bad word] → [sentences].
The problem is that although deleting a letter creates a few “bad words," you have many candidates to add or replace.
So, I would suggest another solution;)
Note: The edit distance you are describing is called Levenshtein distance.
The solution is described in a step-by-step step, usually the search speed should constantly improve with each idea, and I tried to organize them first with the help of simple ideas (in terms of implementation). Feel free to stop when you are comfortable with the results.
0. Preliminary
- Implement the Levenshtein distance algorithm
- Store the dictionaries in a sorted sequence (e.g.
std::set
, although sorting std::deque
or std::vector
would be more efficient in terms of performance)
Keys:
- An array is used in calculating the Levenshtein distance, at each step the next line is calculated only with the previous line
- The minimum distance in a row always exceeds (or equals) the minimum value in a previous row
The last property allows you to implement a short circuit: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current line exceeds 2, you can refuse to calculate. A simple strategy is to return + 1 as the distance.
1. The first preliminary
Let it begin simply.
We implement a linear scan: for each word we calculate the distance (short circuit), and we list those words that have reached the shorter distance so far.
It works great on small dictionaries.
2. Improving data structure
Levenshtein distance is not less than the difference in length.
Using a pair (length, word) as a key instead of a simple word, you can limit your search to the length range [length - edit, length + edit]
and significantly reduce the search space.
3. Prefixes and trimming
To improve this, we can notice that when we construct the distance matrix, in turn, one world is completely scanned (the word we are looking for), and the other (referent) is not: we use only one letter for each line.
This very important property means that for two referents who have the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to keep the dictionnary sort? This means that words that have the same prefix are contiguous.
Suppose that you check your word on cartoon
, and in car
you understand that it does not work (the distance is already too long), then any word starting with car
will also not work, you can skip words while they start with car
.
The slip itself can be performed either linearly or using a search (find the first word with a higher prefix than car
):
- linear works best if the prefix is ​​long (a few words to skip)
- binary search is best for short prefix (many words for omissions)
How long the “long” depends on your vocabulary and you will have to measure. I would go with binary search to start with.
Note. Length split works with prefix split, but it reduces much more search space
4. Prefixes and reuse
Now we will also try to reuse the calculation as much as possible (and not just the result "does not work")
Suppose you have two words:
First you compute the matrix, line by line, for cartoon
. Then when reading carwash
you need to determine the length of the general prefix (here car
), and you can save the first 4 rows of the matrix (corresponding to void, c
, a
, r
)).
Therefore, when you start calculating carwash
, you actually start the iteration with w
.
To do this, simply use the array allocated right at the beginning of your search and make it large enough to accommodate a large link (you should know what is the longest in your dictionary).
5. Using the “best” data structure
To make working with prefixes easier, you can use Trie or Tree Patricia to store the dictionary. However, this is not an STL data structure, and you will need to increase it to store in each subtree the range of lengths of words that are stored, so you have to do your own implementation. It is not as simple as it seems, because there are problems with breaking the memory that can kill locality.
This is the last option. It is expensive.