How to find the largest substring of two lines?

How to find the largest COMMON substring among two strings?

+4
source share
3 answers

Perhaps using the suffix tree . Create trees for both rows, then use these structures to find common paths.

+5
source

I think you can solve this using a rather elegant dynamic programming algorithm that works in O (mn) time and O (mn) space, where m and n are the number of characters in each line. The idea is based on the following repetition. Let two lines: A = a 0 a 1 a 2 ... a n-1 and B = b 0 b 1 b 2 ... b m-1 and look at their first characters a 0 and b 0 . Then you can find the three longest common subsequences:

  • If the first characters are equal, one option would be to find the longest common subsequences of the remaining two lines, and then add the first character to the match.
  • Alternatively, you may not match the first two characters. In this case, one option would be to view the longest common subsequence that you could have done, ignoring the first character of the first line. Finally, you can also ignore the first character of the second line.

This gives us a very nice repetition:

LCS(A[0 .. n], B[0 .. m]) = longest of { A[0] + LCS(A[1 .. n], B[1 .. m]) (if A[0] == B[0]), LCS(A[1 .. n], B[0 .. m]), LCS(A[0 .. n], B[1 .. m]) } 

As basic cases, the longest common substring of any string and empty string is the empty string:

 LCS (A[n .. n], B[i, m]) = "" LCS (A[i .. n], B[m, m]) = "" 

This definition of the longest common substring allows us to calculate the LCS value (A [i .. n], B [j .. m]), taking into account the three LCS values ​​(A [i + 1 .. n], B [j + 1 .. m]), LCS (A [i .. n], B [j + 1 .. m]) and LCS (A [i + 1 .. n], B [j..m]). Therefore, if you calculate these values ​​in the correct order, you can simply populate the table with the results in one pass and build the result from there. Using some standard DP tricks, this can be done to execute in O (mn).

+4
source

In principle, there are two ways: 1. dynamic programming, the cost of which is O (mn) and O (mn). "templatetypedef" answered this. 2. The suffix tree, you need to build it first, the building process with the suffix is ​​O (m + n) (time and space), if without the suffix the link is O ((m + n) ^ 2) (time). Although building a suffix tree has the same efficiency with dynamic programming at best, however, after you build it, you can get the longest common substring in time O (k) (k is the length of the LCS).

0
source

Source: https://habr.com/ru/post/1338397/


All Articles