Does the reasoning for switching to text inconsistency occur in the KMP algorithm?

I tried to understand the KMP algorithm. However, I did not get a clear understanding of the reasoning behind the kmp algorithm. Suppose my text is bacbababaabcbab and the template is abababca . Using the rule for the length of the longest regular sub(pattern) prefix, which corresponds to the sub(pattern) proper suffix, I populated my table[] .

abababca
0 0 1 2 3 4 0 1

Now I started applying the KMP algorithm for text with my template and table.

After going to index 4 from the above text, we will match length(l)=5; by looking at table[l-1]=3; . According to the KMP algorithm, we can skip up to 2 characters and continue.

bacbababaabcbab
---- xx |||
abababca

Here I do not get the shift logic. Why should we change? Can someone clarify my confusion?

+6
source share
2 answers

To understand the logic of the KMP algorithm, you must first understand how this KMP algorithm is better than brute force.

 Idea 

After shifting the template, the naive algorithm forgot all the information about previously matched characters. Thus, it is possible that he again and again repeats the comparison of a text character with different pattern characters. This leads to the worst complexity complexity Θ (nm) (n: text length, m: picture length).

The Knuth, Morris, and Pratt [KMP 77] algorithm uses information obtained by comparing previous characters. It never repeats the comparison of the text character that matches the pattern character. As a result, the complexity of the search phase of the Knut-Morris-Pratt algorithm is in O (n).

However, analysis of its structure requires preliminary processing of the template. The pre-treatment phase has a complexity of O (m). Since m <= n, the overall complexity of the Knut-Morris-Pratt algorithm is in O (n).

text: bacbababaabcbab picture: abababca

In the brute force method, Slide the pattern across the text one by one and check the match. If a match is found, then slides back to 1 to check for subsequent matches.

 void search(char *pat, char *txt) { int M = strlen(pat); int N = strlen(txt); /* A loop to slide pat[] one by one */ for (int i = 0; i <= N - M; i++) { int j; /* For current index i, check for pattern match */ for (j = 0; j < M; j++) { if (txt[i+j] != pat[j]) break; } if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-1] { printf("Pattern found at index %d \n", i); } } } 

The complexity of the algorithm is above O (nm). In the above algorithm, we never used the compared data that we processed,

 Bacbababaabcbab //let I be iterating variable over this text Abababca //j be iterating variable over pattern 

When i = 0 and j = 0, there is a mismatch (text [i + j]! = Pat [j]), we increase i until a match occurs. When i = 4, there is a match (text [i + j] == pat [j]), increase j until we get a mismatch (if j = patternlength we found a pattern), in this example we find a mismatch with j = 5 for i = 4, the discrepancy occurs when the index is 4 + 5 = 9 in the text. A substring matching ababa **

  • Why we need to choose longest proper prefix which is also proper suffix :

** From the above: we see that the mismatch occurred at 9, where the pattern ended with the substring ababa. Now, if we want to take advantage of the comparisons that we have done so far, we can skip (increase) I by more than 1, then the number of comparisons will be reduced, which will lead to better time complexity.
Now, what advantage can we take with the processed ababa comparison data. If we take a close look: the aba prefix of the ababa string is compared with the text and matched, the same thing happens with the aba suffix. But there is a mismatch of "b with" a

 Bacbababaabcbab |||||| ||||| x |||| || ababab 

But by the naive approach, we increase I to 5. But we know by looking at it, we can set i = 6, since the next occurrence of aba occurs in 6. Therefore, instead of comparing with each element in the text, we preprocess a template to find the longest correct prefix , which is also a suffix (called a border). In the above example, for 'ababa, and the length of the longest border is 3 (which is aba ). Thus, increment by: the length of the substring is the length of the longest border  => 5-3 = 2.
If our comparison fails in aba, then the length of the longest border is 1 and j = 3, therefore it increases by 2.

More on how to preprocess: http://www-igm.univ-mlv.fr/~lecroq/string/node8.html#SECTION0080 http://www.inf.fh-flensburg.de/lang/algorithmen/pattern /kmpen.htm

+4
source

I’m not sure that you have problems understanding only on this issue, so if you do not mind, I will simply describe (with the maximum possible explanation) the whole algorithm. The answer to your question is probably in the last paragraph, but you better read all of this to better understand my terminology.

During the KMP algorithm, you actually count almost the same values ​​as in the table (this is usually called a prefix function). Therefore, when you get position i in the text, you need to calculate the maximum length of the substring in the text ending in position i, which is equal to some template prefix. It is obvious that if and only if the length of this substring is equal to the length of the template, you find the template in the text. So how do you find this prefix value fast? (I suggest that you count them for a template using some O (n ^ 2) algorithm, which is not fast enough). Suppose that we have already done everything for the first i-1 characters of the text, and now we are working with position i . We also need the value of the prefix function for the previous character of the text: p[i-1] .

Let's compare the text [i] and the template [p [i-1]] (indexing from 0, if you don't mind). We already know that pattern[0:p[i-1]-1] == text[i-1+p[i-1],i-1] : the definition of p[i-1] . So, if text[i] == pattern[p[i-1]] , we now know that pattern[0:p[i-1]] == text[i-1+p[i-1] , i] 'and that p [i] = p [i - 1]. But the interesting part begins when text[i] != pattern[p[i-1]] .

When these characters are different, we begin to jump. The reason for this is because we want to find the next possible prefix as fast as we can. So how do we do it. Look at the picture here and follow the explanation (the yellow parts are the substrings found for text[i-1] ). We are trying to find the string s : s:=s1+text[i] . Due to the definition of the prefix function s1=s2, c=test[i] . But we already know (from finding the value for text[i-1] ) that the two yellow parts in the picture are the same, so s3 actually equal to s1 , and therefore s3=s1 . So we can find the length s1 : this is table[p[i-1]] . So now, if c1=text[i] , we have to stop: we found p[i] , this is s1.length + 1 . And if c1!=text[i] , we can just repeat the same jump, now looking at the first table[table[p[i-1]]] characters of the template, and therefore continue until we find the answer, or we get the first 0 characters in this case p[i]:=0 .

+1
source

Source: https://habr.com/ru/post/953827/


All Articles