Cyclic rolling algorithm

I coined the term "cycle", which rolls with the hope that it does not overlap with the existing term. Basically, I'm trying to come up with an algorithm for finding loops in printed text.

Some examples from simple to complex

Example 1

Given:

aaaaabcd 

I want to say:

 5x(a) bcd 

or algorithmically:

 for 1 .. 5 print a end print b print c print d 

Example 2

Given:

 ababababcd 

I want to say:

 4x(ab) cd 

or algorithmically:

 for 1 .. 4 print a print b end print c print d 

Example 3

Given:

 abcdbcdbcdbce 

I want to say:

 a 3x(bcd) bce 

or algorithmically:

 print a for 1 .. 3 print b print c print d end print b print c print d 

This did not remind me of any algorithm that I know of. I feel that some of the problems may be ambiguous, but finding one of the solutions is enough for me now. Efficiency is always welcome, but not necessary. How can i do this?

EDIT

First of all, thanks for all the discussions. I adapted the LZW algorithm from rosetta and ran it on my input:

 abcdbcdbcdbcdef 

who gave me:

 a b c d 8 => bc 10 => db 9 => cd 11 => bcd e f 

where i have a dictionary:

 aa cc bb ee dd ff 8 bc 9 cd 10 db 11 bcd 12 dbc 13 cdb 14 bcde 15 ef 7 ab 

It works well for compression, but that’s not quite what I wanted. What I need is more like compression in the algorithmic representation of my examples, which will have:

  • subsequent sequences (if the sequence repeats, there would be no other sequence between them)
  • no dictionary, but only loops
  • irreducable
  • with maximum sequence sizes (which minimizes the algorithmic representation)
  • and let nested loops be valid (contrary to what I said earlier in the comment)
+4
source share
1 answer

I start with an algorithm that gives maximum sequence sizes . Although this does not always minimize the algorithmic representation, it can be used as an approximation algorithm. Or it can be expanded to an optimal algorithm.

  • Start by creating a suffix array for your text along with the LCP array .
  • Sort the array of indices of the LCP array, the first indices of the large elements of the LCP array. This combines repeating sequences of the same length and allows you to process sequences greedily, starting with the maximum size of the sequence.
  • Extract suffix array entries grouped by LCP value (by group, I mean all entries with a selected LCP value, as well as all entries with large LCP values) and sort them by position in the text.
  • Filter records with positional difference not equal to LCP. For the rest of the entries, get length prefixes equal to LCP. This gives all possible sequences in the text.
  • Add sequences sorted by starting position to an ordered collection (for example, a binary search tree). Sequences are added in order of appearance in the sorted LCP, so longer sequences are added first. Sequences are added only if they are independent or if one of them is completely nested inside the other. Intervals of intersection are ignored. For example, in caba caba bab sequence ab intersects with caba , and therefore it is ignored. But in cababa cababa babab one instance of ab discarded, 2 copies are completely inside the larger sequence, and 2 copies are completely outside it.
  • At the end, this ordered collection contains all the information needed to create an algorithmic representation.

Example:

 Text ababcabab Suffix array ab abab ababcabab abcabab b bab babcabab bcabab cabab LCP array 2 4 2 0 1 3 1 0 Sorted LCP 4 3 2 2 1 1 0 0 Positional difference 5 5 2 2 2 2 - - Filtered LCP - - 2 2 - - - - Filtered prefixes (ab ab) (ab ab) 

An algorithm sketch that creates a minimal algorithmic representation.

Start with the first 4 steps of the previous algorithm. The fifth step must be changed. Now it is impossible to ignore intersecting intervals, so each sequence is added to the collection. Since the collection now contains intersecting intervals, it is better to implement it as some advanced data structure, for example, the Interval Tree .

Then, we recursively determine the length of the algorithmic representation for all sequences containing any nested sequences, starting with the smallest. When each sequence is evaluated, calculate the optimal algorithmic representation for the entire text. The algorithm uses dynamic programming to process the sequence or the entire text: it selects a matrix with the number of columns equal to the length of the text / sequence and the number of lines equal to the length of the algorithmic representation; performing a crawl in the interval tree in order, update this matrix with all the sequences possible for each text position; when several values ​​are possible for a certain cell, either select any of them, or give preference to longer or shorter subsequences.

+2
source

Source: https://habr.com/ru/post/1443492/


All Articles