To give you a little context: I'm trying to convert a sam file to bam
samtools view -bT reference.fasta sequences.sam > sequences.bam
which comes out with the following error
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file
and the violation line is as follows:
SRR808297.2571281 99 gi|309056|gb|L20934.1|MSQMTCG 747 80 101M = 790 142 TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC NM:i:2 MD:Z:98A1A
My sequence is 98 characters, but a probable error occurred while creating the sam file reported by 101 in CIGAR. I can give myself the luxury of losing a couple of readings, and at the moment I do not have access to the source code that created the sam files, so there is no way to track down the error and restart the alignment. In other words, I need a pragmatic solution to continue (for now). Therefore, I developed a python script that calculates the length of my nucleotide string, compares it with what is registered in CIGAR, and saves the "healthy" lines in a new file.
import itertools
import cigar
with open('myfile.sam', 'r') as f:
for line in itertools.islice(f,3,None):
cigar=line.split("\t")[5]
cigarlength = len(Cigar(cigar))
seqlength = len(line.split("\t")[9])
if (cigarlength == seqlength):
...Preserve the line in a new file...
, CIGAR , , CIGAR. , . , -, . ?
Sidenote: , , , , . . :
https:
http:
http:
https: