Convert Barrows-Wheeler without EOF Symbol

I need to perform the famous Burroughs-Wheeler transform in linear time. I found a solution with sorting suffixes and an EOF character, but adding EOF modifies the conversion. For example: consider the line bcababa and two turns

  • s1 = abababc
  • s2 = ababcab

it is clear that s1 <s2. Now with the EOF symbol:

  • s1 = ababa # bc
  • s2 = aba # bcab

and now s2 <s1. And the resulting conversion will be different. How can I do BWT without EOF?

+6
source share
3 answers

You can perform the conversion in linear time and space without the EOF character by calculating the array of suffix of the string combined with itself. Then iterate over the suffix array. If the current value of the suffix array is less than n , add the last rotation symbol to your output array, starting from the position indicated by the current value in the suffix array. However, this approach will lead to a slightly different BWT conversion result, since line spins are not sorted as if the EOF character were present.

A more detailed description can be found here: http://www.quora.com/Algorithms/How-I-can-optimize-burrows-wheeler-transform-and-inverse-transform-to-work-in-On-time-On -space

+4
source

You need to have an EOF character in the string for BWT to work, because otherwise you will not be able to perform the inverse conversion to return the original string. Without EOF, both strings "ba" and "ab" have the same converted version ("ba"). With EOF, conversions are different.

 ab ba ab | a | b b | aba | | ab | ba 

i.e. ab is converted to "| ab" and ba to "b | a".

EOF is necessary for BWT because it marks the starting point of a character loop.

Re: do it without an EOF symbol, according to Wikipedia,

Since any rotation of the input string will lead to the same converted string, the BWT cannot be flipped without adding an EOF, an input marker, or, supplementing the output with information such as an index that allows you to identify the input string from the class of all its rotations.

There is a bijective version of the conversion by which the transformed string uniquely identifies the original. In this version, each line has a unique inverse of the same length.

A bijective transformation is calculated by first factoring the input into a non-increasing sequence of Lyndon's words; such factorization exists by the Chen-Fox-Lindon theorem and can be found in linear time. Then the algorithm sorts all the rotations of all these words; as in the usual Burroughs-Wheeler transform, this creates an ordered sequence of n lines. Then the converted string is obtained by selecting the final character of each of these strings in this sorted list.

+1
source

I know this thread is quite old, but I had the same problem and I came up with the following solution:

  • Find the lexicographic minimum string rotation and keep the offset (need to be flipped) (I use lydon factorization)
  • Use the usual bwt algorithms on a rotated string (this gives the correct output, because all the algos characters followed by the string follow the lexicographically minimal char)
  • To change: do not use, for example, reverse lookup starting at index 0, and write with char in the stored offset
0
source

Source: https://habr.com/ru/post/915355/


All Articles