Is there a function that can calculate the score for aligned sequences based on alignment parameters?

I am trying to score already aligned sequences. Let them talk

seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' 

with the given parameters

 substitution matrix : blosum62 gap open penalty : -5 gap extension penalty : -1 

I looked at the cookbook on biopitone, but all I can get is the blogsum62 replacement matrix, but I feel that it should have someone already implemented such a library.

So can anyone suggest any libraries or shortest code that can solve my problem?

thanks in advance

+6
source share
2 answers

Jessada

The Blosum62 matrix (pay attention to spelling;) is located in Bio.SubsMat.MatrixInfo and is a dictionary with tuples that resolve points (so ('A', 'A') costs 4 points). It has no spaces, and this is only one matrix triangle (therefore it can be ahve ('T', 'A'), but not ('A', 'T'). Biopython has some helper functions, including some in Bio.Pairwise, but this is what I came up with as an answer:

 from Bio.SubsMat import MatrixInfo def score_match(pair, matrix): if pair not in matrix: return matrix[(tuple(reversed(pair)))] else: return matrix[pair] def score_pairwise(seq1, seq2, matrix, gap_s, gap_e): score = 0 gap = False for i in range(len(seq1)): pair = (seq1[i], seq2[i]) if not gap: if '-' in pair: gap = True score += gap_s else: score += score_match(pair, matrix) else: if '-' not in pair: gap = False score += score_match(pair, matrix) else: score += gap_e return score seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' blosum = MatrixInfo.blosum62 score_pairwise(seq1, seq2, blosum, -5, -1) 

Which returns 82 for your alignment. There are almost exclusively beautiful ways to do all this, but this should be a good start.

+8
source

blosum62 is a query of 276 elements.

I prefer to fill in the missing elements because it represents an iteration of only 276 revolutions, while the analyzed sequences are likely to contain more than 276 elements. Therefore, if you find the score of each pair using the score_match () function, this function will have to run an if pair not in matrix test for each element of the sequences, that is, of course, much more than 276 times.

Another thing that takes a lot of time: each score += something creates a new integer and associates the name rating with this new object. Each binding takes a certain amount of time that does not exist with a stream of integers by a generator, which are instantly added to the current amount.

 from Bio.SubsMat.MatrixInfo import blosum62 as blosum from itertools import izip blosum.update(((b,a),val) for (a,b),val in blosum.items()) def score_pairwise(seq1, seq2, matrix, gap_s, gap_e, gap = True): for A,B in izip(seq1, seq2): diag = ('-'==A) or ('-'==B) yield (gap_e if gap else gap_s) if diag else matrix[(A,B)] gap = diag seq1 = 'PAVKDLGAEG-ASDKGT--SHVVY----------TI-QLASTFE' seq2 = 'PAVEDLGATG-ANDKGT--LYNIYARNTEGHPRSTV-QLGSTFE' print sum(score_pairwise(seq1, seq2, blosum, -5, -1)) 

This function score_pairwise () is a generator function since yield returns .

+5
source

Source: https://habr.com/ru/post/886046/


All Articles