Compare string patterns with one or zero mismatch

Given that the string and pattern must be matched, how efficiently it is possible to find matches with zero or one mismatch.

eg) S = abbbaaabbbabab P = abab Matches are abbb(index 0),aaab(index 4),abbb(index 6),abab(index 10) 

I tried to change the KMP algorithm, but I'm not sure about that.

Please give me an idea to solve this problem.

Thanks.

+6
source share
4 answers

Ok, I found it! I found the best algorithm!

This may seem a little bold, but as long as the algorithm I'm going to suggest has both O(m + n) runtime and O(m + n) memory consumption, and the input data itself has the same properties, that the algorithm can only be optimized into a constant.

Algorithms Used

I am going to use a combination between KMP and Rabin Karp algorithms for my solution. Rabin Karp uses hashing to compare substrings of source strings. This requires a linear precalculation of time that uses linear additional memory, but from now on, the comparison between the substrings of two strings is O(1) constant (this is amortized if you handle the conflicts correctly).

What my decision will not do

My solution will not find all occurrences in the first line that matches the second line with no more than 1 difference. However, the algorithm can be changed so that for each initial index in the first row, if there is such a match, at least one of them will be found (this will remain for the reader).

Observations

Let m be the length of the second line and n length of the first line. I am going to divide the task into two parts: if I am going to find a match with the maximum difference, I want to find the substrings of the first line: PREF will be the substring before the only difference and the SUFF substring after the difference. I want len(PREF) + len(SUFF) + 1 = m , where PREF or SUFF will be artificially reduced if necessary (when the lines match no difference).

I am going to base my decision on one very important observation: suppose there is a substring of the first line starting with index i with length m , which corresponds to the second line with no more than one difference. Then, if you take PREF as long as possible, there will still be a solution for SUFF . This is obvious: I just maximize the difference to the end.

Algorithm

And now the algorithm itself follows. Start with regular KMP. Each time the extension of the prefix fails, and traces of failures must be executed, first check if the next letter is skipped, the remaining suffix will correspond to the remaining second line. If so, then the match found with a difference of not more than one character. If not, we continue with the usual KMP doing a Rabin Karp check every time a link to the failure is followed.

Let me clarify further the example of checking Rabin Karp. Suppose we are at a certain stage of KMP, and we find that first.substring[i, i + k - 1] matches the first k letters of the second line. Suppose also that the letter first[i + k] is different from second[k] . Then you check whether first.substring[i + k + 1, i + m - 1] second.substring[k + 1, m - 1] exactly second.substring[k + 1, m - 1] with the help of Rabin Karp. This is exactly the case when you maximally increased the index of the initial form of the prefix i and now you can try to see if there is a match with no more than one difference.

Rabin Karp will only be used when the fail link follows, which moves the initial prefix index with at least one, which means that no more than O(n) Rabin Karp calls are used, each with O(1) complexity, for a common linear complexity.

+5
source

This is called the approximate line corresponding to the problem . In your particular case, you want the maximum editing distance to be 1.

the bit algorithm is a fairly quick way to solve it.

+2
source

A complete overview of the various algorithms and how they are compared with each other is given by Gonzalo Navarro on an Excursion to approximate string matching , Pages 80, 81 and 82 show the results of complexity, including the worst and medium cases, and space requirements for various algorithms.

(In the notation used there, n means the length of the text you are looking for, m in the length of the template, & sigma; to the size of the alphabet and k to the maximum editing distance, which is 1 in your case.)

+2
source

To find all submatrices, including one mismatch, you need 2 z-functions (one for the original P and one for the inverse P). After that, the buld array is from the longest prefix subpatterns for the source and return string S. Later you need to reverse the second array. And in the end, everything is easy: skip the first array and check if the length of the longest prefix is ​​equal to the length of P. If so, then this is a coincidence without errors. If it is shorter then check the second array at position (i + length (P) - 1). If the sum of two values ​​is equal to the length (P) - 1, then this is a fake with one error.

Difficulty - O (len (P) + len (S))

+2
source

Source: https://habr.com/ru/post/913015/


All Articles