How to find probable occurrences of substrings inside another string?

Say I have a rowset:

  • constitution
  • abracadabra
  • fridge
  • Stackoverflow

And I have a “corrupted” sentence in which meaningful substrings of these strings can be found in a specific order or specific count. Words are also not necessarily clearly separated.

What algorithm can help me find the most likely occurrences of rows from a collection in a corrupted sentence?

Here is an example input:

xbracadabrqbonstitution ibracadabrefrigeratos obracadabri xtackoverflotefrigeratos

From this input, I expect that I can recover this array of famous words:

['abcracadabra', 'constitution', 'abracadabra', 'refrigerator', 'abracadabrea', 'stackoverflow', 'refrigerator']

The sentences are quite short (usually 5-6 words), so I can afford to use memory and power algorithms. In addition, damage is always limited to the first few and last characters of each word; the middle is always correct (that is why I am looking for large substrings).

Any idea? Since words are not clearly separated, a simple editing distance does not.

+4
source share
3 answers

Since there are very few words in the dictionary, and the words themselves are quite small, I’ll just try to look for all possible substrings of each word in the dictionary. Of course, it makes no sense to look for substrings of size 0 or 1, you probably want to have a lower threshold for the size of the word.

For each substring, you can simply search for it in the sentence, and if that happens, you can mark it as a possible part of the sentence. For speed, you can do a search in a sentence in O (n) (for example, using KMP or Rabin Karp )

Here is a simple hacking idea in Python (using brute force string matching):

d=["constitution","abracadabra","refrigerator","stackoverflow"] def substring_match(word,sentence,min_length): for start in xrange(0,len(word)): for end in xrange(start+min_length,len(word)): substr=word[start:end+1] if substr in sentence: return True return False def look_for_words(word_dict,sent_word): return [word for word in word_dict if substring_match(word,sent_word,5)] def look(word_dict,sentence): ret=[] for word in sentence.split(): ret.extend(look_for_words(word_dict,word)) return ret if __name__=='__main__': print "\n".join(look(d,"xbracadabrqbonstitution ibracadabrefrigeratos obracadabri xtackoverflotefrigeratos")) 
+1
source

Depending on the size of your problem, I will not worry about optimizing this solution at all, since anything smaller than the exponent will be executed instantly. I will give you only an algorithm that I am sure can give the correct answer, as you might expect for a semi-scientific problem like this. Then we can work on its optimization.

First, you need any heuristic function f that takes the word w and returns the closest word or doesn't match.

Then you simply generate a set of all possible w inside your string. In the worst case, this means taking a set of all strings of length 1, then length 2, and then length 3 to the length of your string. The total number w generated in this way will be around (n * n-1) / 2

If you are worried about speed, you can set the maximum word length, and the generation cost ws drops to a line that is linear in length.

Take your set of words and discard each one in turn in f, you can use any heuristic you want to determine which words are selected as real words from your dictionary, or what to do when your selected words overlap. A simple implementation can sort all words by the index of the initial letter, and any time f returns a match, skips letters to the end of the selected word.

+1
source

You can try the Levenshtein distance algorithm to find words with the minimum distance to words in the dictionary (you define tolerance).

Good luck

0
source

Source: https://habr.com/ru/post/1391199/


All Articles