Currently, I want to sort the hudge fasta file (+ 10 ** 8 lines and sequences) by the size of the sequence. fasta is a clear defined format in using biology to maintain consistency (genetic or protein):
<P → id11 # sequence can be on several lines
<P → id2sequence 2
...
I ran tools that give me the tsv format:
Identifier, length and position in bytes of the identifier.
currently what i am doing is sorting this file by length column and then parse this file and use search to get the appropriate sequence and then add it to a new file.
def get_seq(file, bites):
with open(file) as f_:
f_.seek(bites, 0)
line = f_.readline().strip()
to_return = ""
while not line.startswith('>') or not line:
to_return += line
line = f_.readline().strip()
return to_return
def write_seq(out_file, id_, sequence):
with open(out_file, 'a') as out_file:
out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))
with open(args.fai) as ref:
indice = 0
for line in ref:
spt = line.split()
id_ = spt[0]
seq = get_seq(args.i, int(spt[2]))
write_seq(out_file=args.out, id_=id_, sequence=seq)
: ( )? ? , - , , , ?