How does this code work to get LCP from a suffix array?

Can someone explain how this code works to build LCP from a suffix array? suffixArr[] is an array in which suffixArr[i] contains the index value in the string for the suffix with rank i.

  void LCPconstruct() { int i,C[1001],l; C[suffixArr[0]] = n; for(i=1;i<n;i++) C[suffixArr[i]] = suffixArr[i-1]; l = 0; for(i=0;i<n;i++) { if(C[i]==n) LCPadj[i] = 0; else { while(i+l<n && C[i]+l<n && s[i+l] == s[C[i]+l]) l++; LCPadj[i] = l; l = max(l-1,0); } } for(i=0;i<n;i++) cout<<LCPadj[suffixArr[i]]<<"\n"; } 
+3
c ++ algorithm suffix-array
Oct 17 '14 at 15:41
source share
1 answer

Firstly, it is important to understand that the algorithm processes suffixes in the original order, i.e. the order in which they appear in the input line. Not in lexicographical order.

So, if the input string is abxabc , it first considers abxabc , then bxabc , then xabc , etc.

For each suffix that he considers in this order, he determines the position of the suffix, which is his lexicographic predecessor (*) (so here, and only here he uses the concept of lexicographic order), For the first suffix abxabc predecessor of lexicography, i.e. The suffix that appears immediately before it in the lexicographic ordering of suffixes is abc . He defines this by searching for O (1) in the array C , which was prepared specifically for this purpose.

The inner loop compares the characters abxabc and abc one by one and determines that these two suffixes have the first 2 common characters. This is the l variable in your code, and that means that the entry in LCP for the suffix abxabc must be 2, so we set LCPadj[i] = l . Note that i here refers to the position of the suffix in the input string, and not to its position in the suffix array. So LCPadj not an LCP array (yet). This is an auxiliary data structure.

Then it goes to the next line, which is equal to bxabc . Again, he uses C to find that bc is the lexicographic precursor of this, and then determines how many prefix characters they share. And here comes the trick: you can be sure that it should be at least as much as in the previous step (i.e. 2), minus 1. Why? Since the line that we are currently considering is bxabc , of course, is the suffix of the previously considered line ( abxabc ), so the lexicographic predecessor of this line ( abc ) should also have a suffix that is 1 character shorter ( bc ), and this suffix should also be somewhere in the suffix array, and it should share its prefix with the current line in question, minus the first character. Moreover, there can be no other suffix that is shorter and lexicographically closer to the line in question. The latter is quite logical if you think about how lexicographic sorting works, but there is also formal proof of this (for example, Lemma 5.10 in the lecture of Kärkkäinen here )

So, the basic principle of work is described here. There are a few things to note about your code to fully understand the role of each variable:

  • As explained, C is an auxiliary array ( n integers in length) that stores for each suffix in the input line the position of this other suffix, which is its closest lexicographic predecessor. This array is built not from left to right, but (reasonably), passing through the suffix array from left to right, since this makes it easy to determine the nearest lexicographic predecessor of any line: the immediate lexicographic predecessor of the suffix starting from the position suffixArr[i] should, of course, be in the position suffixArr[i-1] . Confirm in your code that this is how C defined.
  • As mentioned above, LCPadj stores the LCP values ​​for the suffix in the order in which they appear in the input string, and not in the order in which they appear in the suffix array. Therefore, during the output, LCPadj not printed from left to right, but an array of suffix is LCPadj[i] from left to right and LCPadj[i] is printed in this order. Confirm that this is so.

Hope this helps. Let me know if not.




(*) By the lexicographic predecessor, I mean the closest predecessor of the suffix in the lexicographically ordered list of suffixes, i.e. suffix to its immediate left in the suffix array.

+7
Oct 18 '14 at 9:48
source share



All Articles