To understand the logic of the KMP algorithm, you must first understand how this KMP algorithm is better than brute force.
Idea
After shifting the template, the naive algorithm forgot all the information about previously matched characters. Thus, it is possible that he again and again repeats the comparison of a text character with different pattern characters. This leads to the worst complexity complexity Θ (nm) (n: text length, m: picture length).
The Knuth, Morris, and Pratt [KMP 77] algorithm uses information obtained by comparing previous characters. It never repeats the comparison of the text character that matches the pattern character. As a result, the complexity of the search phase of the Knut-Morris-Pratt algorithm is in O (n).
However, analysis of its structure requires preliminary processing of the template. The pre-treatment phase has a complexity of O (m). Since m <= n, the overall complexity of the Knut-Morris-Pratt algorithm is in O (n).
text: bacbababaabcbab picture: abababca
In the brute force method, Slide the pattern across the text one by one and check the match. If a match is found, then slides back to 1 to check for subsequent matches.
void search(char *pat, char *txt) { int M = strlen(pat); int N = strlen(txt); for (int i = 0; i <= N - M; i++) { int j; for (j = 0; j < M; j++) { if (txt[i+j] != pat[j]) break; } if (j == M)
The complexity of the algorithm is above O (nm). In the above algorithm, we never used the compared data that we processed,
Bacbababaabcbab //let I be iterating variable over this text Abababca //j be iterating variable over pattern
When i = 0 and j = 0, there is a mismatch (text [i + j]! = Pat [j]), we increase i until a match occurs. When i = 4, there is a match (text [i + j] == pat [j]), increase j until we get a mismatch (if j = patternlength we found a pattern), in this example we find a mismatch with j = 5 for i = 4, the discrepancy occurs when the index is 4 + 5 = 9 in the text. A substring matching ababa **
Why we need to choose longest proper prefix which is also proper suffix :
** From the above: we see that the mismatch occurred at 9, where the pattern ended with the substring ababa. Now, if we want to take advantage of the comparisons that we have done so far, we can skip (increase) I by more than 1, then the number of comparisons will be reduced, which will lead to better time complexity.
Now, what advantage can we take with the processed ababa comparison data. If we take a close look: the aba prefix of the ababa string is compared with the text and matched, the same thing happens with the aba suffix. But there is a mismatch of "b with" a
Bacbababaabcbab |||||| ||||| x |||| || ababab
But by the naive approach, we increase I to 5. But we know by looking at it, we can set i = 6, since the next occurrence of aba occurs in 6. Therefore, instead of comparing with each element in the text, we preprocess a template to find the longest correct prefix , which is also a suffix (called a border). In the above example, for 'ababa, and the length of the longest border is 3 (which is aba ). Thus, increment by: the length of the substring is the length of the longest border => 5-3 = 2.
If our comparison fails in aba, then the length of the longest border is 1 and j = 3, therefore it increases by 2.
More on how to preprocess: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern /kmpen.htm