Finding the longest perfect match between two lines

Question

Finding the longest perfect match between two lines

I am trying to write a script that will perform two functions if they are provided with two lines:

1 . Find the longest sequence of characters starting with pos [0], which is the same on both lines

Seq1 = 'ATCCTTAGC' Seq2 = 'ATCCAGCAATTC' ^^^^ Match from pos[0] to pos[3] Pos: 0:3 Length: 4 Seq: ATCC

2 . Find the longest character mileage that exists on both lines

 Seq1 = 'TAGCTCCTTAGC' # Contains 'TCCTT' Seq2 = 'GCAGCCATCCTTA' # Contains 'TCCTT' ^ No match as pos[0] Pos1: 4:8 Pos2 7:11 Length: 5 Seq: TCCTT

To complete task 1, I have the following:

 #!/usr/bin/python upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC' downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG' print("Upstream: %s\nDownstream: %s\n") % (upstream_seq, downstream_seq) mh = 0 pos_count = 0 seq = "" position ="" longest_hom="" for i in range(len(upstream_seq)): pos_count += 1 if upstream_seq[i] == downstream_seq[i]: mh += 1 seq += upstream_seq[i] position = pos_count longest_hom = mh else: mh = 0 break print("Pos: 0:%s\nLength: %s\nSeq: %s\n") % (position , longest_hom, seq)

 Upstream: ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC Downstream: ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG Pos: 0:5 Length: 5 Seq: ATACA

I am having problems with problem 2 . So far, I have been looking at alignment between two sequences using BioPython pairwise2 . However, in this case I only need perfect matches (no spaces, no extensions), and I want to see only the longest sequence, and not the consensus that I get:

 from Bio import pairwise2 as pw2 global_align = pw2.align.globalms(upstream_seq, downstream_seq, 3, -1, -.5, -.5) print(global_align[0])

 ('ATACATT-G----GCC-TTGGCTTA-----G--ACTTAGATCTAG-----ACCTGAA----AATAACCTGCCGAAAA-GACC-CGCCCGACTGTTAATACTT-TACGCG-AG-GCT-CAC-CT-TT--TTGT-TG----T---GCTCC--C-', 'ATACA--CGAAAAG-CGTT--CTT-TTTTTGCCACTT---TT--TTTTTA--TG--TTTCAA-AA-CG--GAAAATG---TCG--C--CG----TC--GT-CG-GGAGAG-TGC-CTCCTCTTAGTT-TAT-CAAATAAAGCT--TTCG', 151.0, 0, 153)

Question How to find the longest run of characters that exists on both lines?

+5

python bioinformatics

fugu Oct 20 '17 at 9:28

source share

3 answers

The longest common substring problem can be addressed in several ways, and some are more effective than others. One very effective solution involves dynamic programming, and its implementation in both python 2 and 3 can be found in wikibooks . A naive solution, simpler and more understandable, but less effective, is as follows:

 def longest_common_substring(s1, s2): current_match_start = -1 current_match_end = -1 best_match_start = current_match_start best_match_end = current_match_end min_len = min(len(s1), len(s2)) for i in range(min_len): if s1[i] == s2[i]: current_match_start = current_match_end = i j = 0 while s1[i+j] == s2[i+j] and i+j < min_len: j += 1 current_match_end = current_match_start + j if current_match_end - current_match_start > best_match_end - best_match_start: best_match_start = current_match_start best_match_end = current_match_end return s1[best_match_start:best_match_end] upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC' downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG' print(longest_common_substring(upstream_seq, downstream_seq))

+2

User Oct 20 '17 at 10:00

source share

As mentioned in Eric Duminil, one way to get closer to this is to use difflib.SequenceMatcher.find_longest_match :

 from difflib import SequenceMatcher upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC' downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG' s = SequenceMatcher(None, upstream_seq, downstream_seq) match = s.find_longest_match(0, len(upstream_seq), 0, len(downstream_seq)) print(match) upstream_start = match[0] upstream_end = match[0]+match[2] seq = upstream_seq[match[0]:(match[0]+match[2])] downstream_start = match[1] downstream_end = match[1]+match[2] print("Upstream seq: %s\nstart-stop: %s-%s\n") % (seq, upstream_start, upstream_end) print("Downstream seq: %s\nstart-stop: %s-%s\n") % (seq, downstream_start, downstream_end)

 Match(a=49, b=5, size=7) Upstream seq: CGAAAAG start-stop: 49-56 Downstream seq: CGAAAAG start-stop: 5-12

+1

fugu Oct 20 '17 at 10:16

source share

Eric Duminil · Accepted Answer · 2017-10-20T09:56:35+0000

Here's a shorter code for Problem 1 :

 upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC' downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG' common_prefix = '' for x,y in zip(upstream_seq, downstream_seq): if x == y: common_prefix += x else: break print(common_prefix) # ATACA

The naive approach for Problem 2 would be to simply generate a set of each substring for each row, calculate their intersection and sort by length:

 upstream_seq = 'ATACATTGGCCTTGGCTTAGACTTAGATCTAGACCTGAAAATAACCTGCCGAAAAGACCCGCCCGACTGTTAATACTTTACGCGAGGCTCACCTTTTTGTTGTGCTCCC' downstream_seq = 'ATACACGAAAAGCGTTCTTTTTTTGCCACTTTTTTTTTATGTTTCAAAACGGAAAATGTCGCCGTCGTCGGGAGAGTGCCTCCTCTTAGTTTATCAAATAAAGCTTTCG' def all_substrings(string): n = len(string) return {string[i:j+1] for i in range(n) for j in range(i,n)} print(all_substrings('ABCA')) # {'CA', 'BC', 'ABC', 'C', 'BCA', 'AB', 'A', 'B', 'ABCA'} print(all_substrings(upstream_seq) & all_substrings(downstream_seq)) # {'AAAG', 'CA', 'A', 'AAC', 'TGTT', 'ACT', 'CTTAG', 'GCT', 'ATAC', 'AAAA', 'TTTA', 'AAT', 'GTGC', 'CTT', 'AAAAG', 'TTTG', 'CGAA', 'AA', 'CGAAAAG', 'GCC', 'ACA', 'TGCC', 'AAATAA', 'CTCC', 'TTTTT', 'CGCC', 'CAC', 'GAG', 'CTC', 'CGAAAA', 'ATC', 'TCA', 'GA', 'CGC', 'TGT', 'GT', 'GC', 'GAAA', 'ACTTT', 'AAG', 'TTTT', 'CT', 'AATA', 'TCC', 'CGAAA', 'GAA', 'GAAAAG', 'GTT', 'AG', 'TC', 'AAAAT', 'CC', 'TTT', 'AATAA', 'CTTTT', 'ACTT', 'TTA', 'CTTT', 'GCTT', 'GCCG', 'GTG', 'TACA', 'TT', 'GCG', 'TTTTTG', 'TAG', 'TTG', 'TTAG', 'AAATA', 'CTTTTT', 'AAAT', 'TAA', 'ACG', 'TG', 'GCCT', 'G', 'TAC', 'CCT', 'TCT', 'ATA', 'CTTA', 'CCG', 'CG', 'ATAA', 'GG', 'ATACA', 'AGA', 'TGC', 'C', 'T', 'AT', 'GAAAA', 'CGA', 'GAAAAT', 'TA', 'AC', 'AAA', 'TTTTG'} print(max(all_substrings(upstream_seq) & all_substrings(downstream_seq), key=len)) # CGAAAAG

If you want a more efficient approach, you should use a suffix tree .

If you don't want to reinvent the wheel, you can simply use difflib.SequenceMatcher.find_longest_match

Finding the longest perfect match between two lines

More articles: