Eugene Myers' Diff Algorithm: Searching for the longest common subsequence "A" and "B"

I reviewed the Aiff Myers algorithm document Diff Algorithm . This is an algorithm implemented in the popular diff program.

On page 12 of this document, a pseudo-code of the algorithm is presented for finding the longest common subsequence A and B :

 LCS(A, N, B, M) If N > 0 and M > 0 Then Find the middle snake and the length of an optimal path for A and B. Suppose it is from (x, y) to (u, v). If D > 1 Then LCS(A[1..x], x, B[1..y], y) Output A[x+1..u] LCS(A[u+1..N], Nu, B[v+1..M], Mv) Else If M > N Then Output A[1..N]. Else Output B[1..M]. 

Suppose A = "A" and B = "B". In this case, N = 1 and M = 1. The average snake will be (x, y) = (0, 1) and (u, v) = (0, 1), because there are no diagonals. In this case, D = 1, because the algorithm took only one step.

The algorithm says that the only thing that needs to be done in this scenario is Output B[1..M] , equal to "B", because N> 0, M> 0, D = 1 and M = N. But this It seems wrong because there is no common subsequence between “A” and “B”. Commentary on the article: "If D <= 1, then B is obtained from A by deleting or inserting no more than one character" is incorrect, since "A" must be deleted and "B" added.

What am I misunderstanding here?

+6
source share
1 answer

You misunderstand the definition of a D-path and a snake.

From page 4:

Let a D-path be a path starting with (0,0), which has exactly D, not diagonal edges. Path 0 should consist only of diagonal edges. It follows by simple induction that the D-path must consist of a (D-1) -pair followed by an off-diagonal edge, and then an empty sequence of diagonal edges, called a snake, is possible

So, in your example, where A = "A" and B = "B", the optimal path is 2 paths (one horizontal and one vertical), and the middle snake is an empty string. From the check, we know that LCS is an empty string, but we would like to show an algorithm that will prove it.

First of all, we need to find the middle snake. If you follow the middle snake search algorithm on page 11, you will see that the length of the shortest editing script is 2 and (x, y) = (u, v) = (1,0) or (0, 1). In other words, it is an empty snake in the middle of the road.

The algorithm pseudo-code has some non-obvious conventions:

  • A [m..n] is an empty string if n <m.
  • In recursive LCS calls, the first call uses the ceiling (D / 2) as D, and the second call uses the floor (D / 2). This is made more clear in the text above the algorithm on page 12.

So, in this example case, suppose that the first LCS call finds a middle snake with (x, y) = (u, v) = (1,0), then, since D = 2, the result is a decomposition from:

 LCS(A[1..1], 1, B[1..0], 0) // Output nothing since M = 0 Output A[2..1] // Output nothing since it is an empty string. LCS(A[2..1], 0, B[1..1], 1) // Output nothing since N = 0 

This is correct since these lines do not have a common subsequence.

+6
source

Source: https://habr.com/ru/post/985143/


All Articles