Suffix Array Algorithm

Question

Suffix Array Algorithm

After some reading, I realized what constitutes a suffix array and an LCP array.

Suffix array . Represents the _lexicographic rank of each suffix array.

LCP array . Contains the match of the maximum length prefix between two consecutive suffixes after sorting them lexicographically.

I'm trying to figure out in a couple of days how the suffix and LCP array work exactly .

Here is the code that is taken from Codeforces :

/* Suffix array O(n lg^2 n) LCP table O(n) */ #include <cstdio> #include <algorithm> #include <cstring> using namespace std; #define REP(i, n) for (int i = 0; i < (int)(n); ++i) namespace SuffixArray { const int MAXN = 1 << 21; char * S; int N, gap; int sa[MAXN], pos[MAXN], tmp[MAXN], lcp[MAXN]; bool sufCmp(int i, int j) { if (pos[i] != pos[j]) return pos[i] < pos[j]; i += gap; j += gap; return (i < N && j < N) ? pos[i] < pos[j] : i > j; } void buildSA() { N = strlen(S); REP(i, N) sa[i] = i, pos[i] = S[i]; for (gap = 1;; gap *= 2) { sort(sa, sa + N, sufCmp); REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]); REP(i, N) pos[sa[i]] = tmp[i]; if (tmp[N - 1] == N - 1) break; } } void buildLCP() { for (int i = 0, k = 0; i < N; ++i) if (pos[i] != N - 1) { for (int j = sa[pos[i] + 1]; S[i + k] == S[j + k];) ++k; lcp[pos[i]] = k; if (k)--k; } } } // end namespace SuffixArray

I can’t, I just can’t understand how this algorithm works. I tried using the example with a pencil and paper and wrote the steps associated with this, but lost the connection between them as too complicated for me, at least.

Any help regarding explanation using an example is perhaps much appreciated.

+43

c ++ algorithm data-structures suffix-array

Spandan Jul 20 '13 at 11:21

source share

2 answers

How did you choose Codeforces for this, I believe that you are also familiar with Topcoder :) Please check the link below and it has a few more links. I saw how many people learned the suffix array from just these links.

TopCoder link here

And a piece of gold for programmers. I learned the suffix array from this document

+3

Fallen Jul 20 '13 at 19:22

source share

jogojapan · Accepted Answer · 2013-07-20 15:05

Overview

This is the O (n log n) algorithm for constructing an array of suffixes (or rather, it would be if instead of ::sort two-pass ladle sorting was used).

It works by first sorting 2 grams ^(*) then 4 grams, then 8 grams, etc. of the original string S , so in the i-th iteration, we sort 2 ^i- grams. Obviously, for such iterations there can be no more than log ₂ (n), and the trick is that sorting 2 ^i- grams at the i-th step is facilitated by creating that each comparison of two 2 ^i- grams is performed in O (1) times (not O (2 ⁱ )).

How it's done? Well, in the first iteration, it sorts 2 grams (aka bigrams), and then performs the so-called lexicographic renaming. This means that it creates a new array (of length n ), which stores for each bitram its rank in the bigram sort.

Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'} . Then we assign ranks (i.e. lexicographic names), moving from left to right, starting at rank 0 and increasing the rank whenever we encounter new bigram changes. Thus, we assign ranks:

 ab : 0 ab : 0 [no change to previous] ca : 1 [increment because different from previous] cd : 2 [increment because different from previous] cd : 2 [no change to previous] ea : 3 [increment because different from previous]

These grades are known as lexicographic names.

Now, in the next iteration , we sort 4 grams. This is due to the large number of comparisons between different 4-grams. How do we compare two 4 grams? Well, we could compare them by nature. That would be up to 4 operations per comparison. But instead, we compare them by looking at the rows of two bigrams contained in them using the rank table generated in the previous steps. This rank is the lexicographic rank from the previous 2-gram sort, therefore, if for any given 4-gram its first 2-gram has a higher rank than the first 2 grams of another 4-gram, then it should be lexicographically large somewhere in the first two characters. Therefore, if for two 4-grams the rank of the first 2-gram is identical, they must be the same in the first two characters. In other words, to compare all four characters of two 4-grams, two searches in the rank table are enough.

After sorting, we again create new lexicographic names, this time for 4 grams.

In the third iteration, we need to sort by 8 grams. Again, two reviews in the lexicographic ranking table from the previous step are sufficient to compare all 8 characters of two given 8 grams.

And so on. Each iteration i has two steps:

Sort by 2 ^i- grams using the lexicographic names from the previous iteration to provide a comparison in 2 stages (i.e. time O (1))
Creation of new lexicographic names

We repeat this until all 2 ^i- grams are distinct. If this happens, we are done. How do we know if everyone is different? Well, lexicographic names are an increasing sequence of integers starting from 0. So if the highest lexicographic name generated in the iteration is the same as n-1 , then every 2 ⁱ -gram should have been given its own, distinct lexicographic name .

Implementation

Now let's look at the code to confirm all this. Variables used: sa[] is the suffix array that we are building. pos[] is the rank search table (ie, contains lexicographic names), in particular, pos[k] contains the lexicographic name k -th m-gram of the previous step. tmp[] is a helper array used to create pos[] .

I will give further explanations between the lines of code:

 void buildSA() { N = strlen(S); /* This is a loop that initializes sa[] and pos[]. For sa[] we assume the order the suffixes have in the given string. For pos[] we set the lexicographic rank of each 1-gram using the characters themselves. That makes sense, right? */ REP(i, N) sa[i] = i, pos[i] = S[i]; /* Gap is the length of the m-gram in each step, divided by 2. We start with 2-grams, so gap is 1 initially. It then increases to 2, 4, 8 and so on. */ for (gap = 1;; gap *= 2) { /* We sort by (gap*2)-grams: */ sort(sa, sa + N, sufCmp); /* We compute the lexicographic rank of each m-gram that we have sorted above. Notice how the rank is computed by comparing each n-gram at position i with its neighbor at i+1. If they are identical, the comparison yields 0, so the rank does not increase. Otherwise the comparison yields 1, so the rank increases by 1. */ REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]); /* tmp contains the rank by position. Now we map this into pos, so that in the next step we can look it up per m-gram, rather than by position. */ REP(i, N) pos[sa[i]] = tmp[i]; /* If the largest lexicographic name generated is n-1, we are finished, because this means all m-grams must have been different. */ if (tmp[N - 1] == N - 1) break; } }

About the comparison function

The sufCmp function sufCmp used to compare two (2 * breaks) grams lexicographically. Thus, at the first iteration, the bitramas are compared, in the second iteration - 4 grams, then 8 grams and so on. This is controlled by gap , which is a global variable.

The naive implementation of sufCmp will be as follows:

 bool sufCmp(int i, int j) { int pos_i = sa[i]; int pos_j = sa[j]; int end_i = pos_i + 2*gap; int end_j = pos_j + 2*gap; if (end_i > N) end_i = N; if (end_j > N) end_j = N; while (i < end_i && j < end_j) { if (S[pos_i] != S[pos_j]) return S[pos_i] < S[pos_j]; pos_i += 1; pos_j += 1; } return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j; }

This would compare the (2 * gap) gram at the beginning of the i-th suffix pos_i:=sa[i] with that found at the beginning of the j-th suffix pos_j:=sa[j] . And he will compare them by character, that is, compare S[pos_i] with S[pos_j] , then S[pos_i+1] with S[pos_j+1] and so on. It continues until the characters are identical. As soon as they differ, it returns 1 if the character in the i-th suffix is less than the character in the j-th suffix, 0 otherwise. (Note that return a<b in the function returning int means that you return 1 if the condition is true, and 0 if it is false.)

The complex search condition in return-statement refers to the case when one of the (2 * gap) -grams is at the end of the line. In this case, either pos_i or pos_j reaches n before all (2 * spaces) characters are matched, even if all characters up to this point are identical. Then it will return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct, because if all characters are identical, the shorter size is lexicographically smaller. If pos_i reached the end, the ith suffix should be shorter than the jth suffix.

It is clear that this naive implementation is O (clearance), i.e. its complexity is linear in length (2 * gap) -grams. However, the function used in your code uses lexicographic names to bring this to O (1) (in particular, to two comparisons):

 bool sufCmp(int i, int j) { if (pos[i] != pos[j]) return pos[i] < pos[j]; i += gap; j += gap; return (i < N && j < N) ? pos[i] < pos[j] : i > j; }

As you can see, instead of searching for individual characters S[i] and S[j] we check the lexicographic rank of the i-th and j-th suffix. The lexicographic ranks were calculated in the previous iteration for gram-grams. So, if pos[i] < pos[j] , then the ith suffix sa[i] should start with a space graph, which is lexicographically smaller than the gram-gram at the beginning of sa[j] . In other words, just by looking at pos[i] and pos[j] and comparing them, we compared the first whitespace characters of the two suffixes.

If the series are identical, we continue by comparing pos[i+gap] with pos[j+gap] . This is the same as comparing the following uppercase characters (2 * gap) -grams, i.e. Second half. If the series are indentical again, the two (2 * gap) grams are indentical, so we return 0. Otherwise, we return 1 if the i-th suffix is less than the j-th suffix, 0 otherwise.

Example

The following example illustrates the operation of the algorithm and, in particular, demonstrates the role of lexicographic names in the sorting algorithm.

The string we want to sort is abcxabcd . Three iterations are required to create a suffix array. At each iteration, I will show S (string), sa (current state of the suffix array) and tmp and pos , which are lexicographic names.

First we initialize:

 S abcxabcd sa 01234567 pos abcxabcd

Note how lexicographic names that originally represent the lexicographical rank of unigrams are simply identical to the characters (i.e., the unigrams themselves).

First iteration:

Sort sa using bitrams as a sort criterion:

 sa 04156273

The first two suffixes are 0 and 4, because these are the positions of bigram 'ab'. Then 1 and 5 (the positions of the bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to an order based solely on bigrams characters.

Creation of lexicographic names:

 tmp 00112345

As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each other bigram).

Finally, compare this according to the positions in sa to get pos :

 sa 04156273 tmp 00112345 pos 01350124

(The pos method is created: Go sa from left to right and use the entry to define the index in pos . Use the appropriate entry in tmp to determine the value for this index. So, pos[0]:=0 , pos[4]:=0 , pos[1]:=1 , pos[5]:=1 , pos[6]:=2 , etc. The index comes from sa , the value from tmp .)

Second iteration:

We sort sa again, and again we look at the bits from pos (each of which represents a sequence of two bigrams of the original string).

 sa 04516273

Notice how position 1 to 5 has switched over from the previous version of sa . It used to be 15, now it's 51. This is due to the fact that in the previous iteration the bigrams at pos[1] and bigram at pos[5] were identical (both bc ), but now bigram at pos[5] is 12 , and bigram at pos[1] - 13 . Therefore, position 5 precedes position 1 . This is because lexicographic names are now bigrams of the original string: pos[5] represents bc and pos[6] represents 'cd'. So, together they represent bcd , and pos[1] represents bc and pos[2] represents cx , so together they represent bcx , which is really lexicographically larger than bcd .

Again, we generate lexicographic names by looking at the current sa version from left to right and comparing correlation bigrams in pos :

 tmp 00123456

The first two entries are still identical (both values are 0), since the corresponding digrams in pos are like 01 . The rest is a strictly increasing sequence of integers, since all other bits in pos unique.

We are comparing with the new pos as before (taking indices from sa and values from tmp ):

 sa 04516273 tmp 00123456 pos 02460135

Third iteration:

We sort sa again by taking the pos (as always) bitrams, which are now a sequence of four bigrams in the source string.

 sa 40516273

You will notice that now the first two records have switchable positions: 04 became 40 . This is due to the fact that bigram at pos[0] is 02 , and the one that is in pos[4] is 01 , the latter is obviously less lexicographically. The deep reason is that the two represent abcx and abcd respectively.

The creation of lexicographic names gives:

 tmp 01234567

They are all different, i.e. the highest is 7 , which is n-1 . So, we are done because the sorting is now based on m-grams, which are all different. Even if we continue, the sort order will not change.

Suggestion for improvement

The algorithm used to sort 2 ^i- grams at each iteration seems to be the built-in sort (or std::sort ). This means that it is a sort that takes O (n log n) time in the worst case at each iteration. Since in the worst case there are logical n-iterations, this makes it an O (n (log n) ² ) algorithm. However, sorting could be performed using two sorting passes in the form of a bucket, since the keys that we use to compare sorting (that is, the lexicographic names of the previous step) form an increasing whole sequence. Thus, this can be improved to a real O (n log n) -time algorithm for sorting suffixes.

Note

I believe that this is the original algorithm for constructing a suffix array, which was proposed in a 1992 article by Manber and Myers ( on Google Scholar , this should be the first hit, and it may have a link to a PDF file). This (at the same time, but independently of the work of Gonnet and Baeza-Yates) consisted in the fact that the introduced arrays of suffixes (also called pat arrays at that time) as a data structure interesting for further study.

Modern algorithms for constructing an array of suffixes are O (n), so the above is no longer the best available algorithm (at least not in terms of theoretical, worst complexity).

Footnotes

^(*) Through 2 grams, I mean a sequence of two consecutive characters of the original string. For example, when S=abcde is a string, then ab , bc , cd , de are 2-grams of S Similarly, abcd and bcde are 4 grams. Typically, an m-gram (for a positive integer m) is a sequence of m consecutive characters. 1 gram is also called unigrams, 2 grams are called bitrams, 3 grams are called trigrams. Some people continue tetragrams, pentagrams, etc.

Note that the suffix S that starts and position i is the (ni) -gram of S In addition, each m-gram (for any m) is a prefix of one of the suffixes S Therefore, sorting m-grams (for the maximum possible m) may be the first step in the way of sorting suffixes.

Suffix Array Algorithm

Overview

Implementation

Example

Suggestion for improvement

Note

Footnotes

More articles: