Sort fasta by sequence size

Currently, I want to sort the hudge fasta file (+ 10 ** 8 lines and sequences) by the size of the sequence. fasta is a clear defined format in using biology to maintain consistency (genetic or protein):

<P → id1

1 # sequence can be on several lines

<P → id2

sequence 2

...

I ran tools that give me the tsv format:

Identifier, length and position in bytes of the identifier.

currently what i am doing is sorting this file by length column and then parse this file and use search to get the appropriate sequence and then add it to a new file.

# this fonction will get the sequence using seek
def get_seq(file, bites):  

    with open(file) as f_:
        f_.seek(bites, 0) # go to the line of interest
        line = f_.readline().strip() # this line is the begin of the 
                                     #sequence
        to_return = "" # init the string which will contains the sequence

        while not line.startswith('>') or not line:  # while we do not 
                                                     # encounter another identifiant
        to_return += line
        line = f_.readline().strip()

    return to_return
# simply append to a file the id and the sequence
def write_seq(out_file, id_, sequence):

    with open(out_file, 'a') as out_file:
        out_file.write('>{}\n{}\n'.format(id_.strip(), sequence))

# main loop will parse the index file and call the function defined below
with open(args.fai) as ref:

    indice = 0

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        seq = get_seq(args.i, int(spt[2]))
        write_seq(out_file=args.out, id_=id_, sequence=seq)

: ( )? ? , - , , , ?

+3
3

, , , . get/write, , /indexer fasta, biopython samtools. () samtools:

subprocess.call(["samtools", "faidx", args.i])
with open(args.fai) as ref:

    for line in ref:

        spt = line.split()
        id_ = spt[0]
        subprocess.call(["samtools", "faidx", args.i, id_, ">>", args.out], shell=True)
+2

bash unix (csplit - )? script, . , .

csplit -z -f tmp_fasta_file_ $1 '/>/' '{*}'

for file in tmp_fasta_file_*
do
  TMP_FASTA_WC=$(wc -l < $file | tr -d ' ')
  FASTA_WC+=$(echo "$file $TMP_FASTA_WC\n")
done

for filename in $(echo -e $FASTA_WC | sort -k2 -r -n | awk -F" " '{print $1}')
do
  cat "$filename" >> $2
done

rm tmp_fasta_file*

- fasta, - , .. ./script.sh input.fasta output.fasta

+2

fastq-sort ( https://github.com/blaiseli/fastq-tools), fastq, bioawk, -L, , fasta:

cat test.fasta \
    | tee >(wc -l > nb_lines_fasta.txt) \
    | bioawk -c fastx '{l = length($seq); printf "@"$name"\n"$seq"\n+\n%.*s\n", l, "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII"}' \
    | tee >(wc -l > nb_lines_fastq.txt) \
    | fastq-sort -L \
    | tee >(wc -l > nb_lines_fastq_sorted.txt) \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    | tee >(wc -l > nb_lines_fasta_sorted.txt) \
    > test_sorted.fasta

fasta → fastq . fastq , . () awk, , " ", https://www.gnu.org/software/gawk/manual/html_node/Format-Modifiers.html#Format-Modifiers.

IIIII... , fastq, fasta, bioawk, , .

. , - IIIII....

fasta . , -r fastq-sort.

, fastq-sort /tmp. - , /tmp .

, : :

cat test.fasta \
    | bioawk -c fastx '{print "@"$name"\n"$seq"\n+\n"$seq}' \
    | fastq-sort -L \
    | bioawk -c fastx '{print ">"$name"\n"$seq}' \
    > test_sorted.fasta

( ), , " " printf tee .

0
source

Source: https://habr.com/ru/post/1670402/


All Articles