Print the length of a sequence using CIGAR

Question

Print the length of a sequence using CIGAR

To give you a little context: I'm trying to convert a sam file to bam

samtools view -bT reference.fasta sequences.sam > sequences.bam

which comes out with the following error

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file

and the violation line is as follows:

SRR808297.2571281       99      gi|309056|gb|L20934.1|MSQMTCG   747     80      101M    =       790     142     TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT      @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC      NM:i:2  MD:Z:98A1A

My sequence is 98 characters, but a probable error occurred while creating the sam file reported by 101 in CIGAR. I can give myself the luxury of losing a couple of readings, and at the moment I do not have access to the source code that created the sam files, so there is no way to track down the error and restart the alignment. In other words, I need a pragmatic solution to continue (for now). Therefore, I developed a python script that calculates the length of my nucleotide string, compares it with what is registered in CIGAR, and saves the "healthy" lines in a new file.

#!/usr/bin/python
import itertools
import cigar

with open('myfile.sam', 'r') as f:
    for line in itertools.islice(f,3,None): #Loop through the file and skip the first three lines
            cigar=line.split("\t")[5]
            cigarlength = len(Cigar(cigar)) #Use module Cigar to obtain the length reported in the CIGAR string
            seqlength = len(line.split("\t")[9])

            if (cigarlength == seqlength):
                    ...Preserve the line in a new file...

, CIGAR , , CIGAR. , . , -, . ?

Sidenote: , , , , . . :

https://github.com/COMBINE-lab/RapMap/issues/9
http://seqanswers.com/forums/showthread.php?t=67253
http://seqanswers.com/forums/showthread.php?t=21120
https://groups.google.com/forum/#!msg/snap-user/FoDsGeNBDE0/nRFq-GhlAQAJ

+3

python module bioinformatics samtools

j91 26 . '16 19:25

2

jrandall · Answer 1 · 2016-10-02T01:18:55+0000

, , , , . , CIGAR ( M atch ). 101M 98M.

(, , I nsertions, D eletions ), , CIGAR . , , , , , .

, , (, ), , , - , , .

samtools htslib bam_cigar2qlen.

, bam_cigar2qlen sam.h, , , .

, CIGAR , samtools ( htslib), CIGAR M, I, S, =, X .

python, , ( , len(Cigar(cigar))). , ?

, python mask_left mask_right mask="H".

Mark Amery · Answer 2 · 2019-07-07T16:38:07+0000

SAM CIGAR, , "" , , CIGAR:

                                                            Consumes  Consumes
Op BAM Description                                             query  reference
M   0   alignment match (can be a sequence match or mismatch)   yes   yes
I   1   insertion to the reference                              yes   no
D   2   deletion from the reference                             no    yes
N   3   skipped region from the reference                       no    yes
S   4   soft clipping (clipped sequences present in SEQ)        yes   no
H   5   hard clipping (clipped sequences NOT present in SEQ)    no    no
P   6   padding (silent deletion from padded reference)         no    no
=   7   sequence match                                          yes   yes
X   8   sequence mismatch                                       yes   yes

" " " " , CIGAR .

...

M/I/S/=/X SEQ.

" " . , (. Https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93), , OP .

( ) , - Python , :

from itertools import groupby

def query_len(cigar_string):
    """
    Given a CIGAR string, return the number of bases consumed from the
    query sequence.
    """
    read_consuming_ops = ("M", "I", "S", "=", "X")
    result = 0
    cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
    for _, length_digits in cig_iter:
        length = int(''.join(length_digits))
        op = next(next(cig_iter)[1])
        if op in read_consuming_ops:
            result += length
    return result

Print the length of a sequence using CIGAR

More articles: