Extract all occurrences of repeating and unique patterns from text as well as context

Say I have the text "abcabx". I would like to know that there is a repeating pattern of "ab", all the locations that it appears, and how the context of these repetitions relates to its other manifestations. I also want the data structure to have unique "c" and "x" patterns, isolated and isolated. I have a tree suffix setting in an attempt to do this, and it looks like this ( from this SO answer ):

abcabx

This really tells me that the pattern "ab" appears twice, once with the suffix "cabx", and the other with "x". However, the “ab” at the root indicates only the first occurrence of the pattern. It also has another “ab” built into the “cabx” sheet when I want the “ab” (in “cabx”) to be somehow confirmed as a repetition in the data structure. I know that the "x" leaf of the root "ab" represents it, but I need to know in the "cabx" leaf of "ab" that there is "ab" there. In addition, two unique patterns “c” and “x” are part of this edge. Plus, their location in this region and cross-reference between their “main definitions” (root edges?). It seems that such things can be understood, iterating around the tree and combining them, but I need a data structure that stores this information in the right direction.

To put it simply, the data structure should clearly say “here are all unique patterns”, “here are all the repeating patterns and every place in which they occur”, and “here is the context that ties all these things together.”

So, I think I'm looking for a graph-like element in the suffix tree, something that will separate known patterns and directly link them. In this process, unique patterns will be noted. But I still need contextual features of the suffix tree, for example, as "c" (not "cabx", but "c") and "x" appeared after "ab", "abx" appeared after "abc", which (in large cases), etc. Is there an adaptation of the suffix tree that does this, or perhaps another algorithm?

+4
source share
1 answer

The suffix tree basically just stores all the suffixes of the string in such a way as to simplify the search for substrings. Each substring repeated more than once will correspond to exactly one nonterminal node. It is relatively easy to find the context in which the pattern appears - if you count the number of characters in each branch, this will give you the offset of the end of the substring from the end of the sequence, for example. there are two branches from ab , one of length 1 and one of length 4, so you know that the pattern displays 3 and 6 characters from the end of the line or 3 and 0 from the beginning.

+1
source

Source: https://habr.com/ru/post/1432946/


All Articles