How to find repeating sequences of events

I am trying to find an efficient algorithm for identifying a repeating sequence of characters. Let them say that a sequence can be at least 3 characters, but only returns a sequence of maximum length. A dataset could potentially be thousands of characters. In addition, I want to know only about the sequence if it is repeated, say, 3 times.

As an example: ASHEKBSHEKCSHEDSHEK

"SHEK" occurs 3 times and will be identified. "SHE" occurs 4 times, but is not identified, since "SHEK" is a sequence of maximum length that contains this sequence.

In addition, no sequence of "seeds" is fed into the algorithm; it must find them autonomously.

Thanks in advance, J

+4
source share
4 answers

Try creating an array of suffixes for the string.

Interactive Builder: http://allisons.org/ll/AlgDS/Strings/Suffix/

Check the beginning of consecutive lines in the suffix array to match

+2
source

If you think there are possible start lines \ sum (n) / 2, and you are not just looking for a match, but also the substring with the most matches, I think your algorithm will have terrible theoretical complexity if it should be correct and complete .

However, you can get practical speed using Trie . The algorithm will look something like this:

  • For each offset into a string ... 1 For each substring of length ...
    • Insert it into a trie. Each node in the trie has a data value (the integer "count") that you increment when you visit the node.

After you build a trie to model your data, remove all subtrees from the trie with roots below some optimization threshold (3 in your case).

Those remaining paths should be few so that you can efficiently sort and select the ones you need.

I suggest this as a starting point because Trie is built to manage common prefixes and, as a side effect, compresses your dataset.

The personal choice I would make is to locate the substrings as a separate process after identifying the ones I want. Otherwise, you will store every substring location and it will explode your memory. Your calculations are already quite complicated.

Hope this makes sense! Good luck

0
source

Consider the following algorithm, where:

str is an event string

T(i) is the suffix tree for the substring str(0..i) .

T(i+1) quickly obtained from T(i) , for example, using this algorithm

For each position of the character i in the input string str , cross a path starting at the root of T(i) along the edges marked with consecutive characters from the input, starting at position i + 1 .

This path defines a repeating line. If the path is longer than the previously found paths, write the new maximum length and position i + 1 .

Update the suffix tree with str [i+1] and repeat for the next position.

Something like this pseudo code:

 max.len = 0 max.off = -1 T = update_suffix_tree (nil, str [0]) for i = 1 to len (str) r = root (T) j = i + 1 while j < len (str) and r.child (str [j]) != nil r = r.child (str [j]) ++j if j - i - 1 > max.len max.len = j - i - 1 max.off = i + 1 T = update_suffix_tree (T, str [i+1]) 

In the k th iteration, the internal while is performed in no more than n - k iterations, and the construction of the suffix tree is O(k) , therefore, the complexity of the body of the cycle is O(n) , and it performed n-1 times, therefore the whole complexity of the O(n^2) .

0
source

Source: https://habr.com/ru/post/1380596/


All Articles