Find a common substring between two lines

I would like to compare 2 lines and keep matching by separating the place where the comparison failed.

So, if I have 2 lines -

string1 = apples string2 = appleses answer = apples 

Another example, since a string can contain more than one word.

 string1 = apple pie available string2 = apple pies answer = apple pie 

I'm sure there is an easy way for Python to do this, but I can't handle it, any help and explanation is appreciated.

+51
python string algorithm time-complexity dynamic-programming
Sep 10 '13 at 9:50
source share
13 answers

It is called the longest common substring problem. Here I present a simple, understandable, but ineffective solution. It will take a long time to get the correct output for large strings, since the complexity of this algorithm is O (N ^ 2).

 def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): match = "" for j in range(len2): if (i + j < len1 and string1[i + j] == string2[j]): match += string2[j] else: if (len(match) > len(answer)): answer = match match = "" return answer print longestSubstringFinder("apple pie available", "apple pies") print longestSubstringFinder("apples", "appleses") print longestSubstringFinder("bapples", "cappleses") 

Exit

 apple pie apples apples 
+15
Sep 10 '13 at 11:28
source share

For complete difflib , the standard library contains many sequence comparison utilities. For example find_longest_match , which finds the longest common substring when used in strings. Usage example:

 from difflib import SequenceMatcher string1 = "apple pie available" string2 = "come have some apple pies" match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2)) print(match) # -> Match(a=0, b=15, size=9) print(string1[match.a: match.a + match.size]) # -> apple pie print(string2[match.b: match.b + match.size]) # -> apple pie 
+127
Sep 09 '16 at 6:00
source share
 def common_start(sa, sb): """ returns the longest common substring from the beginning of sa and sb """ def _iter(): for a, b in zip(sa, sb): if a == b: yield a else: return return ''.join(_iter()) 
 >>> common_start("apple pie available", "apple pies") 'apple pie' 



Or a little weirder:

 def stop_iter(): """An easy way to break out of a generator""" raise StopIteration def common_start(sa, sb): return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb)) 

What could be more readable since

 def terminating(cond): """An easy way to break out of a generator""" if cond: return True raise StopIteration def common_start(sa, sb): return ''.join(a for a, b in zip(sa, sb) if terminating(a == b)) 
+32
Sep 10 '13 at 9:59 on
source share

You can also consider os.path.commonprefix which works with characters and, therefore, can be used for any strings.

 import os common = os.path.commonprefix(['apple pie available', 'apple pies']) assert common == 'apple pie' 
+9
Nov 07 '18 at
source share

Same as Evo , but with an arbitrary number of lines to compare:

 def common_start(*strings): """ Returns the longest common substring from the beginning of the `strings` """ def _iter(): for z in zip(*strings): if z.count(z[0]) == len(z): # check all elements in `z` are the same yield z[0] else: return return ''.join(_iter()) 
+5
Jul 15 '16 at 21:42
source share

Correct errors with the first answer:

 def longestSubstringFinder(string1, string2): answer = "" len1, len2 = len(string1), len(string2) for i in range(len1): for j in range(len2): lcs_temp=0 match='' while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]): match += string2[j+lcs_temp] lcs_temp+=1 if (len(match) > len(answer)): answer = match return answer print longestSubstringFinder("dd apple pie available", "apple pies") print longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000") print longestSubstringFinder("bapples", "cappleses") print longestSubstringFinder("apples", "apples") 
+3
Mar 19 '17 at 3:33
source share

Try:

 import itertools as it ''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2))) 

This is a comparison with the beginning of both lines.

+1
Sep 10 '13 at 10:10
source share

This is not the most efficient way to do this, but it is something that I could come up with, and it works. If someone can improve it, please. What he does is make a matrix and put 1 where the characters match. He then scans the matrix to find the longest diagonal of 1 s, tracking where it starts and ends. It then returns a substring of the input string with start and end positions as arguments.

Note. This finds only one longest common substring. If there is more than one, you can make an array to store the results and return it. Also, it is case sensitive, so (Apple pie, apple pie) will return pple pie.

 def longestSubstringFinder(str1, str2): answer = "" if len(str1) == len(str2): if str1==str2: return str1 else: longer=str1 shorter=str2 elif (len(str1) == 0 or len(str2) == 0): return "" elif len(str1)>len(str2): longer=str1 shorter=str2 else: longer=str2 shorter=str1 matrix = numpy.zeros((len(shorter), len(longer))) for i in range(len(shorter)): for j in range(len(longer)): if shorter[i]== longer[j]: matrix[i][j]=1 longest=0 start=[-1,-1] end=[-1,-1] for i in range(len(shorter)-1, -1, -1): for j in range(len(longer)): count=0 begin = [i,j] while matrix[i][j]==1: finish=[i,j] count=count+1 if j==len(longer)-1 or i==len(shorter)-1: break else: j=j+1 i=i+1 i = i-count if count>longest: longest=count start=begin end=finish break answer=shorter[int(start[0]): int(end[0])+1] return answer 
+1
Jan 09 '16 at 12:43 on
source share
 def matchingString(x,y): match='' for i in range(0,len(x)): for j in range(0,len(y)): k=1 # now applying while condition untill we find a substring match and length of substring is less than length of x and y while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]): if len(match) <= len(x[i:i+k]): match = x[i:i+k] k=k+1 return match print matchingString('apple','ale') #le print matchingString('apple pie available','apple pies') #apple pie 
+1
Oct 15 '17 at 17:25
source share

Returns the first long common substring:

 def compareTwoStrings(string1, string2): list1 = list(string1) list2 = list(string2) match = [] output = "" length = 0 for i in range(0, len(list1)): if list1[i] in list2: match.append(list1[i]) for j in range(i + 1, len(list1)): if ''.join(list1[i:j]) in string2: match.append(''.join(list1[i:j])) else: continue else: continue for string in match: if length < len(list(string)): length = len(list(string)) output = string else: continue return output 
0
Dec 22 '15 at 15:07
source share

First, a helper function adapted from the itertools pairing recipe to create substrings.

 import itertools def n_wise(iterable, n = 2): '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ... n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...''' a = itertools.tee(iterable, n) for x, thing in enumerate(a[1:]): for _ in range(x+1): next(thing, None) return zip(*a) 

The function then iterates over the substrings, the longest at first, and tests for membership. (performance not taken into account)

 def foo(s1, s2): '''Finds the longest matching substring ''' # the longest matching substring can only be as long as the shortest string #which string is shortest? shortest, longest = sorted([s1, s2], key = len) #iterate over substrings, longest substrings first for n in range(len(shortest)+1, 2, -1): for sub in n_wise(shortest, n): sub = ''.join(sub) if sub in longest: #return the first one found, it should be the longest return sub s = "fdomainster" t = "exdomainid" print(foo(s,t)) 



 >>> domain >>> 
0
Aug 12 '16 at 7:02
source share

s1 = "pineapple" s2 = "applc"

def LongestSubString (s1, s2): left = 0 right = len (s2)

 while(left<right): if(s2[left] not in s1): left = left+1 else: if(s2[left:right] not in s1): right = right - 1 else: return(s2[left:right]) break 

Blockquote

0
Oct 27 '18 at 21:10
source share

This is a problem in the class called "Longest Sequence Search." I gave some simple code that worked for me, also my inputs are sequence lists, which can also be a string, can help you:

 def longest_substring(list1,list2): both=[] if len(list1)>len(list2): small=list2 big=list1 else: small=list1 big=list2 removes=0 stop=0 for i in small: for j in big: print i, j if i!=j: removes+=1 if stop==1: break elif i==j: both.append(i) for q in range(removes+1): big.pop(0) stop=1 break removes=0 return both 
0
Dec 05 '18 at 11:55
source share



All Articles