How to determine if a string is a concatenation of a string list

Suppose we are given row S and a list of some other rows L.

How can we find out if S is one of all possible combinations of L?

For instance:

S = "abcdabce"

L = ["abcd", "a", "bc", "e"]

S is "abcd" + "a" + "bc" + "e", then S is the concatenation of L, while "ababcecd" is not.

To solve this issue, I tried using DFS / backtracking. The pseudocode is as follows:

boolean isConcatenation(S, L) { if (L.length == 1 && S == L[0]) return true; for (String s: L) { if (S.startwith(s)) { markAsVisited(s); if (isConcatnation(S.exclude(s), L.exclude(s))) return true; markAsUnvisited(s); } } return false; } 

However, DFS / backtracking is not an effective solution. I am curious what is the fastest algorithm to solve this issue or is there any other algorithm to solve it faster. I hope there are algorithms like KMP that can solve this in O (n) time.

+6
source share
7 answers

In python:

 >>> yes = 'abcdabce' >>> no = 'ababcecd' >>> L = ['abcd','a','bc','e'] >>> yes in [''.join(p) for p in itertools.permutations(L)] True >>> no in [''.join(p) for p in itertools.permutations(L)] False 

edit: as indicated, this is n! complex, therefore not suitable for large L. But hey, development time is less than 10 seconds.

Instead, you can create your own permutation generator, starting with the main permutation device:

 def all_perms(elements): if len(elements) <=1: yield elements else: for perm in all_perms(elements[1:]): for i in range(len(elements)): yield perm[:i] + elements[0:1] + perm[i:] 

And then drop the branches you don't need, keeping track of what the elements will be concatenated, and only iterate if it adds your target string.

  def all_perms(elements, conc=''): ... for perm in all_perms(elements[1:], conc + elements[0]): ... if target.startswith(''.join(conc)): ... 
+3
source

The dynamic programming approach should work from left to right, creating an array A [x], where A [x] is true if the first x characters of a string form one of the possible concatenations of L. You can work out A A [n] given earlier A [n] , by checking each possible line in the list - if the characters S to the nth character correspond to a candidate string of length k, and if A [nk] is true, then you can set A [n] true.

I note that you can use https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to find the matches you need as input to a dynamic program. The corresponding costs will be linear in the size of the input line, the total size of all the candidate lines and the number of matches between the input line and the candidate lines.

+2
source

I would try the following:

  • find all positions of patterns L_i in S
  • let n = length (S) +1
  • create graph with n nodes
  • for all L_i positions i: directed edges: node_i โ†’ L_i corresponds to node โ†’ node_ {i + length (L_i)}
  • to enable permutation restrictions, you need to add some more node / edges to exclude multiple use of the same template
  • Now I can ask a new question: is there a directional path from 0 to n ?

notes:

  • if there exists node ( 0 <n ) with degree <2, then a match is not possible
  • all nodes that have d- = 1, d + = 1 are part of the permutation
  • bread first or diskstra to find a solution
0
source

You can use the Trie data structure. First, build a trie from the lines in L.

Then for the input string S, find S in trie.

During the search for each visited node, which is the end of one of the words in L, call a new search in trie (from the root) with the remaining (but unsurpassed) suffix S. So, we are using recursion. If you consume all S characters in this process, then you know that S is a concatenation of some strings from L.

0
source

I would suggest this solution:

  • Take an array of size 256, which will store the number of matches of each character in all lines of L. Now try to compare this with the count of each character of S. If both are unequal, then we can say with confidence that they cannot form this character.
  • If the counts match, do the following: Using the KMP algorithm, try to find each line at the same time in L in S. If at any moment there is a match, we delete this line from L and continue to search for other lines in L. If at any time we do not find a match, we just print that it is impossible to imagine. If L is empty at the end, we conclude that S is indeed a concatenation of L.

Assuming L is a set of unique strings.

0
source

Haskell's two suggestions:

There may be some examples for this ... just for fun ... Sorting L by individual look:

 import Data.List (sortBy,isInfixOf) hsl = (concat . sortBy wierd $ l) == s where wierd ab | isInfixOf (a ++ b) s = LT | isInfixOf (b ++ a) s = GT | otherwise = EQ 


More boring ... trying to build S from L:

 import Data.List (delete,isPrefixOf) fsl = gsl [] where g str subs result | concat result == s = [result] | otherwise = if null str || null subs' then [] else do sub <- subs' g (drop (length sub) str) (delete sub subs) (result ++ [sub]) where subs' = filter (flip isPrefixOf str) subs 

Output:

 *Main> f "abcdabce" ["abcd", "a", "bc", "e", "abc"] [["abcd","a","bc","e"],["abcd","abc","e"]] *Main> h "abcdabce" ["abcd", "a", "bc", "e", "abc"] False *Main> h "abcdabce" ["abcd", "a", "bc", "e"] True 
0
source

Your algorithm has complexity N ^ 2 (N is the length of the list). Let's see in real C ++

 #include <string> #include <vector> #include <algorithm> #include <iostream> using namespace std; typedef pair<string::const_iterator, string::const_iterator> stringp; typedef vector<string> strings; bool isConcatenation(stringp S, const strings L) { for (strings::const_iterator p = L.begin(); p != L.end(); ++p) { auto M = mismatch(p->begin(), p->end(), S.first); if (M.first == p->end()) { if (L.size() == 1) return true; strings T; T.insert(T.end(), L.begin(), p); strings::const_iterator v = p; T.insert(T.end(), ++v, L.end()); if (isConcatenation(make_pair(M.second, S.second), T)) return true; } } return false; } 

Instead of looping over the whole vector, we could sort it and then shorten the search to steps O (LOG (N)) in the best case, where all lines start with different characters. The worst case will remain O (N ^ 2).

0
source

Source: https://habr.com/ru/post/949774/


All Articles