I tried to build a suffix tree based on the implementation of the Ukkonen algorithm according to the Mark Nelson method in java code, which is a variant of the code in:
http://www.sanfoundry.com/java-program-implement-suffix-tree/
The following code creates a tree of compact suffixes (compressed suffix trie) from scanning a text file containing the word "minimum", divided in a text file as follows:
min
im
ize
The suffix tree is compressed in the form of a list of arrays based on the Ukkonen algorithm using Edge-Label compression, so all suffixes can refer to the index of one representation of the array.
The code also displays all the contents and details of the suffix tree as follows:
Start End Suf First Last String
0 10 -1 7 7 e
0 4 0 1 1 i
0 6 4 0 1 mi
0 3 -1 2 7 nimize
0 9 -1 6 7 ze
4 5 -1 4 7 mize
4 2 -1 2 7 nimize
4 8 -1 6 7 ze
6 1 -1 2 7 nimize
6 7 -1 6 7 ze
I used a constructor that was modified from the current constructor in Mark Nelson’s Java code for its implementation by Ukkonen’s Java algorithm from the link above, but the rest of its code remains untouched:
public CompressedSuffixTrie(String f)
{
Edges = new Edge[ HASH_TABLE_SIZE ];
for (int i = 0; i < HASH_TABLE_SIZE; i++)
Edges[i] = new Edge();
Nodes = new Node[ MAX_LENGTH * 2 ];
for (int i = 0; i < MAX_LENGTH * 2 ; i++)
Nodes[i] = new Node();
active = new Suffix( 0, 0, -1 );
Scanner s;
try {
s = new Scanner(new File(f + ".txt"));
ArrayList<String> arraylist = new ArrayList<String>();
while (s.hasNextLine()){
arraylist.add(s.nextLine());
}
s.close();
StringBuilder sb = new StringBuilder();
for (String str : arraylist) {
sb.append(str);
}
String str = sb.toString();
this.T = str.toCharArray();
this.N = this.T.length - 1;
for (int i = 0 ; i <= this.N ; i++ )
this.AddPrefix( this.active, i );
this.dump_edges( this.N );
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
The code seems to work correctly, but I want to be able to find the first occurrence of pattern s in the suffix tree, which returns the starting index of the first occurrence of pattern s. For example, if s appears in the suffix tree, findString (s) will return the starting index of the first occurrence of s in the suffix tree. Otherwise, it will return -1.
Is there a way to do this so that the findString (s) method is not slower than the time complexity of O (| s |), where | s | is the length s?