Build the shortest string from the collection of its subsequence

Given a set of subsequences from a string.

For instance:

abc acd bcd 

The problem is how to determine the shortest string from these sequences?

In the above example, the shortest string is abcd .

Here, subsequences mean part of the string, but not necessarily sequential. for example, acd is a subsequence of the string abcd .

Edit: this problem actually comes from Project Euler problem 79 , in this task we have 50 subsequences, each of which has a length of 3. and all characters are numbers.

+5
source share
5 answers

This problem has been well studied and invented " The shortest general supersymmetry. "

For two lines, it is in O (n). Let n be the maximum length of a string. O (n) is used to create an array of suffixes and then O (n) to solve the longest common subsequence .

For more than two lines, this is NP-Complete.

+5
source

Complexity

In the general case, as mentioned above, this problem will be NP-hard. There is a way to resolve the issue with a suffix structure, but you can also use a directed graph for this. I can’t say for sure if this is better in a sense, but there may be some advantage in difficult corner cases.

Graphical solution

To understand how you can use a chart, you just need to build it correctly. Let the letters of the lines be the vertices, and the order of the letters will determine the edges. This means that ab will mean the vertices a and b and the connection between a and b . Of course, the connections are directed (i.e., if a connected to b , this does not mean that b connected to a , so ab is a --> b )

After these definitions, you can build the entire schedule. For your sample, this is:

Simple regular case

- simple enough. But abcd can also be represented by strings of two lengths like ab, ac, ad, bc, bd, cd - so I will also show a graph for this (this is just for clarity):

Complex regular case

Then, to find your solution, you need to find the path with the maximum length on this graph. Obviously, there is a place from which NP-hardness is “derived”. In both cases, the maximum length is 4 , which will be achieved if we start from the vertex a and the movement graph to the found path a->b->c->d .

Corner cases

Unequal solution : in fact, you may encounter sets of strings that cannot strictly define a superset. Example: ab, ad, bd, ac, cd . Both abcd and acbd correspond to these substrings. But, in fact, this is not so bad. Look at the chart:

Ambiguous way case

(I chose this for a reason: it looks like a second graph, but without one connection, so the result is mixed). The maximum path length is now 3 - but this can be achieved using two paths: abd and acd . So, how to restore at least one solution? Simple: since the lines of results will be the same length (which comes from the definition of the way we found them), we can simply go from the beginning of the first line and check the characters in the second line. If they match, then this is the strict position of the character. But if they don’t, then we can choose any order between the current characters. So this will be:

  • [a] matches [a] , so the current result is a
  • [b] does not match [c] , so we can place either bc or cb . Let it be the first. Then the result is abc
  • [d] matches [d] , so the current result is abc + d , that is, abcd

This is a kind of "merger", where we can choose any result. However, this is a slightly distorted case, because now we can’t only use any path found - we must find all paths with the maximum length (otherwise we will not be able to restore full supersymmetry)

Further, a nonexistent solution : there may be cases when the order of characters in some lines cannot be used to reproduce supersymmetry. Obviously, this means that one order violates another order, so there will be two lines in which some two characters have different orders. Again, the simplest case is: abc, cbd . On the chart:

Non-existent solution

- the inevitable consequence will be the cycle of the schedule - it may not be as simple (as in the chart above), but it will always be if the order is disturbed. Thus, all you need to understand is to find a graph schedule . In fact. this is not something you should do separately. You simply add loop detection in your search for a long graph path.

Third: duplicate characters : this is the most difficult case. If the string contains duplicate characters, then creating a chart is more difficult, but you can still do it. For example, suppose three conditions: aac, abc, bbc . The solution for this is aabbc . How to build a schedule? Now we can’t just add links (at least in loops). But I suggest the following way:

  • Let's process all our lines by assigning indexes to characters. The reset pointer to a string. If a character appears only once, the index will be 1 . For our lines, this means: 1 a 2 c 1 , a 1 b 1 c 1 and b 1 b 2 c 1 . After that, save the maximum index for each character found. For the sample above, which will be 2 for a , 2 for b and 1 for c
  • If two linked indexed characters have the same original character, then the connection is performed “as is”. For example, 1 a 2 will create only one 1 in 2 connection
  • If two related indexed characters have different source characters, then any first indexed character can be associated with any second indexed character. For example, b 1 c 1 will result in the compound (b 1 to c 1 ) and (b 2 to c 1 ). How do we know how many indexed characters can be? We did this in the first step (i.e., found maximum indices)

Thus, the chart for the example above will look like this:

  + ------------ +
           |  |
           |  |
  + ------> [a2] ------ + |
  |  |  |  |
  |  VV |
 [a1] ----> [c1] <---- [b2] |
  |  ^ ^ |
  |  |  |  |
  + ------> [b1] ------ + |
           ^ |
           |  |
           + ------------ +

- we will have a maximum length of 5 for paths a 1 a 2 b 2 b 1 c 1 and 1 a 2 b 1 b 2 c <sub> 1 sub>. We can select any of them, ignoring the indices in the aabbc result aabbc .

+3
source

Maybe there is no general effective algorithm for such a problem. But this is the solution for this particular projecteerer 79 problem. You want to see the input file. You will find that 7 appears only at the beginning of all sequences, so 7 should be placed at the beginning. Then you find that the number 3 is the only character at the beginning, then you put 3 in the second position. So, and so on, you can get 73162890 . A special case is the last digits 2 , both 0 and 9 are at the beginning, then you have a choice of 2 . then you will try both and get 90 optimal solution.

More generally, you can use the depth search algorithm. The trick is that when you find that there is one character that appears only at the beginning, just select it, it will be the way to the optimal solution.

+2
source

I think we can start with graphs.
I am not sure of the correctness, but I will say if we build a graph with a-> b, if b comes after a, with all paths of length 1.
Now we need to find the longest distance (DFS can help).
I will try to give an example.

say the lines are:

a
DSA
Bcd
Bce

form a graph:
enter image description here
Now it remains only to combine the nodes e and d, because the required string can be abcde or abced .

This is something I don’t know how to do, so maybe someone can help!

Sorry to post this as an answer, comments cannot contain images.

+1
source

Construct a graph from subsequences, and then find the topological form of the graph in O (E) using DFS to get the desired row with the shortest length, which has the entire subsequence. But, as you would have noticed, topo-sort is not valid if there are loops in the graph, in this case it is necessary to repeat characters in loops that are difficult to solve.

Conclusion: - You are lucky if there are no cycles in the schedule and solve it in O (E), otherwise you are not lucky and in the end make brute force.

+1
source

Source: https://habr.com/ru/post/979624/


All Articles