"NotImplementedError: SeqRecord" when using file sorted fasta processed using SeqIO

I am trying to sort a fasta file alphabetically by order of sequences in a file (rather than sequence identifier). The fasta file contains more than 200 sequences, and I'm trying to find duplicates (by duplicates, I mean almost the same protein sequence, but not the same ID) in the bit host (using python code). So I wanted to make a dictionary from a fasta file, and then sort the dictionary values. The code I'm trying to use is as follows:

from Bio import SeqIO


input_file = open("PP_Seq.fasta")    
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
print sorted(my_dict.values())

I get an error message:

"Traceback (most recent call last):
  File "sort.py", line 4, in <module>
    print sorted(my_dict.values())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqRecord.py", line 730, in __lt__
    raise NotImplementedError(_NO_SEQRECORD_COMPARISON)
NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest."

, , ares't , , , , , ?.. , fasta SeqIO?

+4
2

mata, sorted:

from Bio import SeqIO
import operator
input_file = open("example.fasta")    
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))
for r in sorted(my_dict.values(), key=operator.attrgetter('seq')):
    print r.id, str(r.seq)

:

seq3 ABCDEFG
seq0 ABCWYXO
seq2 BCDEFGH
seq1 IJKLMNOP

, . 200 , , . , .

, (, ) , .

. .

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

, ( //) . FASTA:

from Bio import SeqIO
from itertools import combinations
input_file = open("example.fasta")    

treshold = 4
records = SeqIO.parse(input_file, "fasta")
for record1, record2 in combinations(records, 2):
    edit_distance = levenshteinDistance(str(record1.seq), str(record2.seq))
    if edit_distance <= treshold:
        print "{} and {} differ in {} characters".format(record1.id, record2.id, edit_distance)

:

seq0 and seq3 differ in 4 characters
seq2 and seq3 differ in 2 characters
+2

fasta bioawk fastq-tools ( awk uniq, , UNIX):

bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' test.fasta \
    | fastq-sort -s \
    | bioawk -c fastx '{print $name"\t"$seq}' \
    | uniq -f 1 \
    | awk '{print ">"$1"\n"$2}'

bioawk - awk, .

fastq, fastq-sort. -c fastx bioawk fasta fastq. $name a $seq, . $seq , fastq.

fastq-sort ( fastq-tools) ( -s).

bioawk, , .

uniq , ( -f, ). uniq , .

awk , , fasta.

, , .

+1

Source: https://habr.com/ru/post/1670400/


All Articles