Find the number of gaps in a sequence

I have a program that analyzes the sequence of alleles. I am trying to write code that determines whether an allele is complete or not. To do this, I need to count the number of gaps in the control sequence. A gap is indicated by the string '-'. If there is more than one gap, I want the program to say "Incomplete allele."

How can I figure out how to count the number of gaps in a sequence?

Here is an example of a broken sequence:

>DQB1*04:02:01
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
--ATGTCTTGGAAGAAGGCTTTGCGGAT-------CCCTGGAGGCCTTCGGGTAGCAACT
GTGACCTT----GATGCTGGCGATGCTGAGCACCCCGGTGGCTGAGGGCAGAGACTCTCC
CGAGGATTTCGTGTTCCAGTTTAAGGGCATGTGCTACTTCACCAACGGGACCGAGCGCGT
GCGGGGTGTGACCAGATACATCTATAACCGAGAGGAGTACGCGCGCTTCGACAGCGACGT
GGGGGTGTATCGGGCGGTGACGCCGCTGGGGCGGCTTGACGCCGAGTACTGGAATAGCCA
GAAGGACATCCTGGAGGAGGACCGGGCGTCGGTGGACACCGTATGCAGACACAACTACCA
GTTGGAGCTCCGCACGACCTTGCAGCGGCGA-----------------------------
-----------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---GTGGAGCCCACAGTGACCATCTCCCCATCCAGGACAGAGGCCCTCAACCACCACAAC
CTGCTGGTCTGCTCAGTGACAGATTTCTATCCAGCCCAGATCAAAGTCCGGTGGTTTCGG
AATGACCAGGAGGAGACAACTGGCGTTGTGTCCACCCCCCTTATTAGGAACGGTGACTGG
ACCTTCCAGATCCTGGTGATGCTGGAAATGACTCCCCAGCGTGGAGACGTCTACACCTGC
CACGTGGAGCACCCCAGCCTCCAGAACCCCATCATCGTGGAGTGGCGGGCTCAGTCTGAA
TCTGCCCAGAGCAAGATGCTGAGTGG----CATTGGAGGCTTCGTGCTGGGGCTGATCTT
CCTCGGGCTGGGCCTTATTATC--------------CATCACAGGAGTCAGAAAGGGCTC
CTGCACTGA---------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------

The code I still have is as follows:

idx=[]
for m in range(len(sequence)):
    for n in re.finditer('-',sequence[0]): 
        idx.append(n.start())
counter=0
min_val=[]
for n in range(len(idx)):
    if counter==idx[n]:
        counter=counter+1
    elif counter !=0:
        min_val.append(idx[n-1])
        counter=0

, "-", , . , , .

+4
3

, -+, -. , , .

>>> sequence = """>DQB1*04:02:01....."""
>>> joined = ''.join(sequence.splitlines())
>>> sum(1 for m in re.finditer("-+", joined))
7

. - .

: :

>>> sum(1 for m in re.finditer("[GATC]+", joined))
6
0

"-" .

str_list = filter(None, sequence.split('-'))
if len(str_list) > 2:
    return "Incomplete Allele"
else:
    return "Complete Allele"
+1

I think this should do:

def test(sequence):
    sequence = ''.join(sequence.splitlines()[1:]) # remove first line (header and line breaks)
    S = [segments for segments in sequence.split('-') if block != '']
    if len(S)>2: # len(S) should be the number of remaining segments
        print "Incomplete Allele."
0
source

Source: https://habr.com/ru/post/1650562/


All Articles