Looking for the longest common subsequence in a variable number of sets without duplicate characters?

I am trying to find an efficient algorithm that will work in JavaScript for the longest common subsequence problem . However, there are two main differences between my problem and what is described in the Wikipedia article. Firstly, I will have more than two character sets. Secondly, characters will never be repeated in a set. This means that the length of each set will be no more than 50 characters (i.e. printed ASCII characters).

For example, kits may contain:

A = ZBANICOT B = ACNTBZIO C = ANICOTZB D = ZIANCOTB 

... and it should output ACO , ACT , ANO and ANT , since these are the longest subsequences in all 4 sets (as far as I can tell from the manual study of these sets).

Since none of the letters is repeated, is there a more efficient algorithm that I should consider, and not the one described in the Wikipedia article? If not, is there somewhere where it is described how to convert the algorithm to N sets instead of 2?

+1
source share
3 answers

Not sure how effective the adaptation of the normal LCS algorithm is, so it may be inefficient, but since it has few matching letters, it is not too terrible.

Make a successor matrix. For each row s for each pair of letters (s[i],s[j]) with i < j , increment matrix[s[i]][s[j]] . A pair of letters can be part of the longest common subsequence if matrix[a][b] = n = number of strings . Create a directed graph from the letters participating in such a pair, with an edge from a to b iff matrix[a][b] = n . Find the longest path on this graph.

+4
source

What you are trying to do is called multiple sequence alignment . The algorithm may be rewritten as n-dimensional, but this will increase the complexity of the space to Length ^ Dimensions .

Since none of the letters is repeated, is there a more efficient algorithm that I should consider, and not the one described in the Wikipedia article?

May be. First create a map (a, b), so a <b. For example, matching for the first three characters of each set is as follows

 --- Mapping for character[0] Z - BA - CA - NZ - I ANIA NICN IBOC CZTO OIZT TOBB --- Mapping for character[1] B - AC - NN - II - A NICN IBOC CZTO OIZT TOBB --- Mapping for character[2] A - NN - IN - CI - N IBOC CZTO OIZT TOBB 

Then save only the mappings common to all sets. Here is an example of this process running across Set [A] ( ZBANICOT ), which will give us a set of relationships common to all sets.

So, with the letter Z you will notice that in Set [C] the only character that follows Z is B , so you can exclude almost every mapping from Z that doesn't belong to B But, to shorten the long history, every mapping for Z is deleted. Note that only mappings from Z are deleted, not mappings in Z

Also note that since Set [D] ends in B , we can safely exclude all mappings from B Following this process, you will notice that A maps to N , C , O and T Next is N, which maps to T and O

I maps to O (to be honest, "I / O"), and C maps to O and T (how cute). And funny enough O doesn't display anything !!! As befitting.

Finally, there is another character, and T does not display anything (since ZBANICOT ends in T ).

In the end, you are left with the following mappings:

 A - CC - OI - ON - O NTT O T 

Now you only need to start with each of the keys ( A , C , I and N ) and look for the longest sequence without any duplicate nodes (characters in this case).

Here are all the common subsequences (all using a simple algorithm):

 ACO ACT ANO ANT AO AT CO CT IO NT NO 
+1
source

What if I saved a character mapping for each character set and then incremented that count every time ...

For the first set ( ZBANICOT ), this will create the following pairs of characters (28 of them):

 ZB, ZA, ZN, ZI, ZC, ZO, ZT BA, BN, BI, BC, BO, BT AN, AI, AC, AO, AT NI, NC, NO, NT IC, IO, IT CO, CT OT 

For the second set ( ACNTBZIO ) I will have 28 pairs again:

 AC, AN, AT, AB, AZ, AI, AO CN, CT, CB, CZ, CI, CO NT, NB, NZ, NI, NO TB, TZ, TI, TO BZ, BI, BO ZI, ZO IO 

For the last two sets, I would:

 AN, AI, AC, AO, AT, AZ, AB NI, NC, NO, NT, NZ, NB IC, IO, IT, IZ, IB CO, CT, CZ, CB OT, OZ, OB TZ, TB ZB ZI, ZA, ZN, ZC, ZO, ZT, ZB IA, IN, IC, IO, IT, IB AN, AC, AO, AT, AB NC, NO, NT, NB CO, CT, CB OT, OB TB 

(So, to determine the iteration counter so far, let N be the number of sets and M number of characters in each set, in this example we have N*(M)((M-1)/2) or 4*28=112 )

Next, I would count the number of times each pair exists. For instance,

 PairCounts["ZB"] == 3 PairCounts["ZA"] == 2 // ... PairCounts["AO"] == 4 PairCounts["AT"] == 4 // ... PairCounts["AN"] == 4 PairCounts["AC"] == 4 PairCounts["CT"] == 4 PairCounts["NO"] == 4 PairCounts["NT"] == 4 

Then, since we have 4 pairs, I would drop all pairs that did not have the number 4. And then I would try to match each pair to form the longest line.

0
source

Source: https://habr.com/ru/post/1388779/


All Articles