Sed converts a multi-line block to a single line (for example: fasta to phylip format)

In short :

how to convert from fasta to a "phylip" -like format (without the sequence and residues at the top of the file) using sed?

The fasta format is as follows:

>sequence1
AATCG
GG-AT
>sequence2
AGTCG
GGGAT

The number of lines in a sequence may vary.

I want to convert it to this:

sequence1 AATCG GG-AT
sequence2 AGTCG GGGAT

My question seems simple, but I lack a real understanding of extended commands in sedmultiline commands and commands using a hold buffer.

Here is the implementation idea I had: fill the template space with a sequence and only print it when a new sequence label is encountered. For this, I would:

  • , ^>. :
  • ^> :

manual, , :

  • P p: ( )? " ".
  • ?
  • ?

python, perl awk, , " " , sed.


, :

script , . , , , :

#!/bin/sed -nf
1h
2,3H
4{x; s/\n/ /g; p}
5H
6{H;x; s/\n/ /g; p}

sed -nf fa2phy.sed my.fasta .

+1
3

sed

sed '/>/N;:A;/\n>/!{s/\n/ /;N;bA};h;s/\(.*\)\n.*/\1/p;x;s/.*\n//;bA' infile
+1

awk .

1st:

awk '/^>/{sub(/>/,"");if(val){print val, val2};val=$0;val2="";next} {val2=val2?val2 FS $0:$0} END{print val, val2}'  Input_file

2nd:

awk -v RS=">" -v FS="\n" '{for(i=1;i<=NF;i++){printf("%s%s",$i,i==NF?"\n":" ")}}'   Input_file

3rd:

awk -v RS=">" '{gsub(/\n/," ");} NF'   Input_file
0

, , .

script : fa2phy.sed:

#!/bin/sed -nf

:readseq
${H;b out}              # if last line, append to hold, and goto 'out'
1{h;n;b readseq}        # if first, overwrite hold, and start again at 'readseq'
/^>/!{H; n; b readseq}  # if not a sequence label, append to hold, read next line, start again at 'readseq'. Else, it continues to 'out'

:out
x;         # exchange hold content with pattern content
s/^>//;    # substitute the starting '>'
s/\n/  /g; # substitute each newline with 2 spaces
p;         # print pattern buffer

, - , !:)

0
source

Source: https://habr.com/ru/post/1688316/


All Articles