SequenceMatcher for multiple inputs, not just two?

Question

SequenceMatcher for multiple inputs, not just two?

wonders how best to approach this particular problem, and if there are any libraries (preferably python, but if necessary I can be flexible).

I have a file with a line in each line. I would like to find the longest common patterns and their locations on each line. I know that I can use SequenceMatcher to compare rows one and two, one and three, and so on, and then compare the results, but if there is something that already does this?

Ideally, these matches will appear anywhere on each line, but for starters, I can be fine with them, existing with the same offset in each line and from there. Something like a compression library that has a good API for accessing its row table might be ideal, but I have not yet found anything that matches this description.

For example, with these lines:

\x00\x00\x8c\x9e\x28\x28\x62\xf2\x97\x47\x81\x40\x3e\x4b\xa6\x0e\xfe\x8b
\x00\x00\xa8\x23\x2d\x28\x28\x0e\xb3\x47\x81\x40\x3e\x9c\xfa\x0b\x78\xed
\x00\x00\xb5\x30\xed\xe9\xac\x28\x28\x4b\x81\x40\x3e\xe7\xb2\x78\x7d\x3e

I would like to see that 0-1 and 10-12 coincide on all lines in the same position, and line1 [4,5] corresponds to line2 [5,6] corresponds to line 3 [7,8].

Thank,

+3

python compression

Peck Apr 01 '10 at 19:01

source share

2 answers

?

2?

, . , , .

0

Philippe Ombredanne 02 . '10 14:33

weronika · Accepted Answer · 2011-06-15T21:56:41+0000

If you want to find common substrings that have the same offset in each line, all you need is something like this:

matches = []
zipped_strings = zip(s1,s2,s3)
startpos = -1
for i in len(zipped_strings):
  c1,c2,c3 = zipped_strings[i]
  # if you're not inside a match, 
  #  look for matching characters and save the match start position
  if startpos==-1 and c1==c2==c3:
    startpos = i
  # if you are inside a match, 
  #  look for non-matching characters, save the match to matches, reset startpos
  elif startpos>-1 and not c1==c2==c3:
    matches.append((startpos,i,s1[startpos:i]))
    # matches will contain (startpos,endpos,matchstring) tuples
    startpos = -1
# if you're still inside a match when you run out of string, save that match too!
if startpos>-1:
  endpos = len(zipped_strings)
  matches.append((startpos,endpos,s1[startpos:endpos]))

, SequenceMatcher , string1 2, string1 string3 , string1 string2 ( get_matching_blocks), string3 .

SequenceMatcher for multiple inputs, not just two?

More articles: