The longest common substring of more than two lines - Python

I am looking for a Python library to find the longest common substring from a set of strings. There are two ways to solve this problem:

  • using suffixes
  • using dynamic programming.

The implemented method does not matter. It is important that it can be used for a set of strings (not just two strings).

+47
python string longest-substring
May 23 '10 at 18:37
source share
7 answers

These paired functions will find the longest common string in any arbitrary array of strings:

def long_substr(data): substr = '' if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and is_substr(data[0][i:i+j], data): substr = data[0][i:i+j] return substr def is_substr(find, data): if len(data) < 1 and len(find) < 1: return False for i in range(len(data)): if find not in data[i]: return False return True print long_substr(['Oh, hello, my friend.', 'I prefer Jelly Belly beans.', 'When hell freezes over!']) 

Without a doubt, the algorithm could be improved, and I did not have many features for Python, so it might be more efficient and syntactic, but it should do the job.

EDIT: the nested second function is_substr, as demonstrated by J. F. Sebastian. Use remains the same. Note: without changing the algorithm.

 def long_substr(data): substr = '' if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and all(data[0][i:i+j] in x for x in data): substr = data[0][i:i+j] return substr 

Hope this helps,

Jason

+49
May 24 '10 at
source share
β€” -

I prefer this for is_substr , as I find it more readable and intuitive:

 def is_substr(find, data): """ inputs a substring to find, returns True only if found for each data in data list """ if len(find) < 1 or len(data) < 1: return False # expected input DNE is_found = True # and-ing to False anywhere in data will return False for i in data: print "Looking for substring %s in %s..." % (find, i) is_found = is_found and find in i return is_found 
+4
Feb 18 '13 at 20:12
source share
 def common_prefix(strings): """ Find the longest string that is a prefix of all the strings. """ if not strings: return '' prefix = strings[0] for s in strings: if len(s) < len(prefix): prefix = prefix[:len(s)] if not prefix: return '' for i in range(len(prefix)): if prefix[i] != s[i]: prefix = prefix[:i] break return prefix 

From http://bitbucket.org/ned/cog/src/tip/cogapp/whiteutils.py

+2
May 23 '10 at 18:59
source share
 # this does not increase asymptotical complexity # but can still waste more time than it saves. TODO: profile def shortest_of(strings): return min(strings, key=len) def long_substr(strings): substr = "" if not strings: return substr reference = shortest_of(strings) #strings[0] length = len(reference) #find a suitable slice i:j for i in xrange(length): #only consider strings long at least len(substr) + 1 for j in xrange(i + len(substr) + 1, length + 1): candidate = reference[i:j] # ↓ is the slice recalculated every time? if all(candidate in text for text in strings): substr = candidate return substr 

Disclaimer This adds very little jtjacques response. However, I hope this should be more readable and faster, and it does not fit into the comment, so I am posting this in response. To be honest, I'm not satisfied with shortest_of .

+2
May 24 '10 at 7:51
source share

You can use the SuffixTree module, which is a wrapper based on the implementation of ANSI C generic suffix trees. The module is easy to use ....

Take a look at: here

+2
Apr 16 '11 at 9:30 a.m.
source share

This can be done shorter:

 def long_substr(data): substrs = lambda x: {x[i:i+j] for i in range(len(x)) for j in range(len(x) - i + 1)} s = substrs(data[0]) for val in data[1:]: s.intersection_update(substrs(val)) return max(s, key=len) 

(possibly) implemented as hash maps, making this a bit inefficient. If you (1) implement the given data type as trie and (2) just save the postfix in trie and then force each node to be the endpoint (this would be equivalent to adding all the substrings), THEN in theory, I would assume that this child is pretty memory efficient, especially since the intersections of the attempts are very light.

However, this is a short and premature optimization - this is the root of a significant amount of time lost.

+2
Sep 16 '15 at 14:29
source share

If someone is looking for a generic version that can also take a list of sequences of arbitrary objects:

 def get_longest_common_subseq(data): substr = [] if len(data) > 1 and len(data[0]) > 0: for i in range(len(data[0])): for j in range(len(data[0])-i+1): if j > len(substr) and is_subseq_of_any(data[0][i:i+j], data): substr = data[0][i:i+j] return substr def is_subseq_of_any(find, data): if len(data) < 1 and len(find) < 1: return False for i in range(len(data)): if not is_subseq(find, data[i]): return False return True # Will also return True if possible_subseq == seq. def is_subseq(possible_subseq, seq): if len(possible_subseq) > len(seq): return False def get_length_n_slices(n): for i in xrange(len(seq) + 1 - n): yield seq[i:i+n] for slyce in get_length_n_slices(len(possible_subseq)): if slyce == possible_subseq: return True return False print get_longest_common_subseq([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6]]) print get_longest_common_subseq(['Oh, hello, my friend.', 'I prefer Jelly Belly beans.', 'When hell freezes over!']) 
+1
Mar 05 '15 at 3:53 on
source share



All Articles