Algorithm for optimizing a string from substrings

Suppose I have a set of substrings, for example:

string a = {"cat","sensitive","ate","energy","tense"} 

Then the conclusion for this should be as follows:

 catensesensitivenergy 

How can i do this?

+6
source share
3 answers

This problem is known as the shortest common super string problem, and it is NP-hard, so if you need an exact solution, you cannot do much better than trying to use all the features and choose the best one.

One possible exponential solution is to generate all permutations of the input lines, find the shortest common superstring with greed for each permutation (the permutation sets the order of the lines, and you can prove that for a fixed order the greedy algorithm always works correctly) and select the best one.

+5
source

Using the user2040251 clause:

 #include <string> #include <iostream> #include <algorithm> std::string merge_strings( const std::vector< std::string > & pool ) { std::string retval; for( auto s : pool ) if( retval.empty() ) retval.append( s ); else if( std::search( retval.begin(), retval.end(), s.begin(), s.end() ) == retval.end() ) { size_t len = std::min( retval.size(), s.size() ); for( ; len; --len ) if( retval.substr( retval.size() - len ) == s.substr( 0, len ) ) { retval.append( s.substr( len ) ); break; } if( !len ) retval.append( s ); } return retval; } std::string shortest_common_supersequence( std::vector< std::string > & pool ) { std::sort( pool.begin(), pool.end() ); std::string buffer; std::string best_reduction = merge_strings( pool ); while( std::next_permutation( pool.begin(), pool.end() ) ) { buffer = merge_strings( pool ); if( buffer.size() < best_reduction.size() ) best_reduction = buffer; } return best_reduction; } int main( int argc, char ** argv ) { std::vector< std::string > a{"cat","sensitive","ate","energy","tense"}; std::vector< std::string > b{"cat","sensitive","ate","energy","tense","sit"}; std::vector< std::string > c{"personal","ate","energy","tense","gyroscope"}; std::cout << "best a --> \"" << shortest_common_supersequence( a ) << "\"\n"; std::cout << "best b --> \"" << shortest_common_supersequence( b ) << "\"\n"; std::cout << "best c --> \"" << shortest_common_supersequence( c ) << "\"\n"; return 0; } 

Output:

 best a --> "catensensitivenergy" best b --> "catensensitivenergy" best c --> "atensenergyroscopersonal" 
+3
source

Separate the problem and see what happened. Starting from two lines. We need to check which suffix of one line is the longest prefix of another. This gives us an order of better concatenation.

Now, with a set of Russian words, how do we do it? We start by creating a trie containing each word (key for each word). If the word is a duplicate of another, we can easily designate it as such when constructing the prefix tree.

I quickly completed the regular Trie. You can find it here .

We have tools for constructing a directed graph linking different words, and the suffix of the first is the prefix of the second. Edge weight is the length of the suffix.

To do this, for each word w of the input data set, we need to see what words we can achieve with the suffix w:

  • We go down the trie using the suffix. We will end up in node (or not).
  • From this node, if it exists, we scan the remaining subtree to see which words are available.
    • If the given suffix of length l gives a match with the prefix of the word w ', then we add the edge w → w', with the weight length (w ') - l.
    • If such an edge already exists, we simply update the weight to maintain the lowest.

A graph is established from there, and we must find the shortest path that passes through each vertex (for example, a word) only once . If the schedule is completed, this is a problem with the seller . In most cases, the schedule will not be completed.

However, this remains an NP-hard problem. In more "technical" terms, the problem is to find the shortest Hamiltonian path of the digraph .

Note. If a Hamiltonian path is given (if it exists) with its value C and its starting vertex (word) W, then the length of the super-summation is determined as follows:

L super = L W + C

Note. If in a nutshell there is no suffix connecting them with another word, then the graph is not connected and there is no Hamiltonian path.

+1
source

Source: https://habr.com/ru/post/979623/


All Articles