Comparison of java templates for the first appearance of the template s in the suffix tree for the implementation of the Ukkonovs Mark Nelson algorithm

I tried to build a suffix tree based on the implementation of the Ukkonen algorithm according to the Mark Nelson method in java code, which is a variant of the code in: http://www.sanfoundry.com/java-program-implement-suffix-tree/

The following code creates a tree of compact suffixes (compressed suffix trie) from scanning a text file containing the word "minimum", divided in a text file as follows:

min
im
ize    

The suffix tree is compressed in the form of a list of arrays based on the Ukkonen algorithm using Edge-Label compression, so all suffixes can refer to the index of one representation of the array.

The code also displays all the contents and details of the suffix tree as follows:

Start  End  Suf  First Last  String

  0    10  -1     7      7   e
  0     4   0     1      1   i
  0     6   4     0      1   mi
  0     3  -1     2      7   nimize
  0     9  -1     6      7   ze
  4     5  -1     4      7   mize
  4     2  -1     2      7   nimize
  4     8  -1     6      7   ze
  6     1  -1     2      7   nimize
  6     7  -1     6      7   ze

I used a constructor that was modified from the current constructor in Mark Nelson’s Java code for its implementation by Ukkonen’s Java algorithm from the link above, but the rest of its code remains untouched:

    public CompressedSuffixTrie(String f) // Create a compact compressed suffix trie from file f
{
    Edges = new Edge[ HASH_TABLE_SIZE ];
    for (int i = 0; i < HASH_TABLE_SIZE; i++)
        Edges[i] = new Edge();
    Nodes = new Node[ MAX_LENGTH * 2 ];
    for (int i = 0; i < MAX_LENGTH * 2 ; i++)
        Nodes[i] = new Node();
    active = new Suffix( 0, 0, -1 );

    // Create new Scanner to scan file
    Scanner s;
    try {
        s = new Scanner(new File(f + ".txt"));

        // ArrayList to be converted into String
        ArrayList<String> arraylist = new ArrayList<String>();
        // Add every line of text containing sequence from file f into ArrayList
        while (s.hasNextLine()){
            arraylist.add(s.nextLine());
        }
        s.close();

        // Convert ArrayList to String using advanced for-each loop and StringBuilder
        StringBuilder sb = new StringBuilder();
        for (String str : arraylist) {
            sb.append(str); // append contents to StringBuilder
        }
        String str = sb.toString(); // convert back to String

        /** Construct Suffix Tree **/       
        this.T = str.toCharArray();
        this.N = this.T.length - 1;  

        for (int i = 0 ; i <= this.N ; i++ )
            this.AddPrefix( this.active, i );
        this.dump_edges( this.N );      

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

The code seems to work correctly, but I want to be able to find the first occurrence of pattern s in the suffix tree, which returns the starting index of the first occurrence of pattern s. For example, if s appears in the suffix tree, findString (s) will return the starting index of the first occurrence of s in the suffix tree. Otherwise, it will return -1.

Is there a way to do this so that the findString (s) method is not slower than the time complexity of O (| s |), where | s | is the length s?

+4
1

, , -1 Suf:

(\n?)\s+\d+\s+\d+\s+(?!-1)([\d-]+)\s+\d+\s+\d+\s+(\w+)(\n?)

2 , 3 .

, , , , .

0

Source: https://habr.com/ru/post/1613420/


All Articles