How to convert the PHYLIP format to FASTA

I am just starting to work with perl and I have a question. I have a PHYLIP file and I need to convert it to FASTA. I am starting to write a script. Firstly, I removed the scpaces in the lines, now I need to align all the lines, which in each line should be 60 amino acids, and the sequence identifier should be printed in a new line. Maybe someone can give me advice?

+4
source share
2 answers

BioPerl The Bio :: AlignIO module can help. It supports the PHYLIP format:

phylip2fasta.pl

use strict; use warnings; use Bio::AlignIO; # http://doc.bioperl.org/bioperl-live/Bio/AlignIO.html # http://doc.bioperl.org/bioperl-live/Bio/AlignIO/phylip.html # http://www.bioperl.org/wiki/PHYLIP_multiple_alignment_format my ($inputfilename) = @ARGV; die "must provide phylip file as 1st parameter...\n" unless $inputfilename; my $in = Bio::AlignIO->new(-file => $inputfilename , -format => 'phylip', -interleaved => 1); my $out = Bio::AlignIO->new(-fh => \*STDOUT , -format => 'fasta'); while ( my $aln = $in->next_aln() ) { $out->write_aln($aln); } 

$ perl phylip2fasta.pl test.phylip

 >Turkey/1-42 AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT >Salmo_gair/1-42 AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT >H._Sapiens/1-42 ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA >Chimp/1-42 AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT >Gorilla/1-42 AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA 

test.phylip http://evolution.genetics.washington.edu/phylip/doc/sequence.html

  5 42 Turkey AAGCTNGGGC ATTTCAGGGT Salmo gairAAGCCTTGGC AGTGCAGGGT H. SapiensACCGGTTGGC CGTTCAGGGT Chimp AAACCCTTGC CGTTACGCTT Gorilla AAACCCTTGC CGGTACGCTT GAGCCCGGGC AATACAGGGT AT GAGCCGTGGC CGGGCACGGT AT ACAGGTTGGC CGTTCAGGGT AA AAACCGAGGC CGGGACACTC AT AAACCATTGC CGGTACGCTT AA 
+6
source

If you have access to BioPerl, I suggest using it (see another answer). If not, here is a quick script that I used in the old HW assignment a few years ago. This might work for you.

One note: it prints entire fasta sequences on one line, so you have to edit the print statement at the end to print 70 AA per line.

 #!/usr/bin/perl use warnings; use strict; <DATA> =~ /(\d+)/; # first number is number of species my $num_species = $1; my $i = 0; my @species; my @acids; # first $num_species rows have the species name for ($i = 0; $i < $num_species; $i++) { my @line = split /\s+/, <DATA>; chomp @line; push @species, shift (@line); push @acids, join ("", @line); } # Get the rest of the AAs $i = 0; while (<DATA>) { chomp; $_ =~ s/\r//g; #remove \r next if !$_; $_ =~ s/\s+//g; #remove spaces $acids[$i] .= $_; $i = ++$i % $num_species; } # Print them for ($i = 0; $i < $num_species; $i++) { print "> ", $species[$i], "\n"; # uncomment next line if you want to remove the gaps ("-") $acids[$i] =~ s/-//g; print $acids[$i], "\n\n"; } # Simple PHYLIP Amino Acid file __DATA__ 10 234 Cow MAYPMQLGFQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Carp MAHPTQLGFK DAAMPVMEEL LHFHDHALMI VLLISTLVLY IITAMVSTKL Chicken MANHSQLGFQ DASSPIMEEL VEFHDHALMV ALAICSLVLY LLTLMLMEKL Human MAHAAQVGLQ DATSPIMEEL ITFHDHALMI IFLICFLVLY ALFLTLTTKL Loach MAHPTQLGFQ DAASPVMEEL LHFHDHALMI VFLISALVLY VIITTVSTKL Mouse MAYPFQLGLQ DATSPIMEEL MNFHDHTLMI VFLISSLVLY IISLMLTTKL Rat MAYPFQLGLQ DATSPIMEEL TNFHDHTLMI VFLISSLVLY IISLMLTTKL Seal MAYPLQMGLQ DATSPIMEEL LHFHDHTLMI VFLISSLVLY IISLMLTTKL Whale MAYPFQLGFQ DAASPIMEEL LHFHDHTLMI VFLISSLVLY IITLMLTTKL Frog MAHPSQLGFQ DAASPIMEEL LHFHDHTLMA VFLISTLVLY IITIMMTTKL THTSTMDAQE VETIWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM TNKYILDSQE IEIVWTILPA VILVLIALPS LRILYLMDEI NDPHLTIKAM S-SNTVDAQE VELIWTILPA IVLVLLALPS LQILYMMDEI DEPDLTLKAI TNTNISDAQE METVWTILPA IILVLIALPS LRILYMTDEV NDPSLTIKSI TNMYILDSQE IEIVWTVLPA LILILIALPS LRILYLMDEI NDPHLTIKAM THTSTMDAQE VETIWTILPA VILIMIALPS LRILYMMDEI NNPVLTVKTM THTSTMDAQE VETIWTILPA VILILIALPS LRILYMMDEI NNPVLTVKTM THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEI NNPSLTVKTM THTSTMDAQE VETVWTILPA IILILIALPS LRILYMMDEV NNPSLTVKTM TNTNLMDAQE IEMVWTIMPA ISLIMIALPS LRILYLMDEV NDPHLTIKAI GHQWYWSYEY TDYEDLSFDS YMIPTSELKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TDYENLGFDS YMVPTQDLAP GQFRLLETDH RMVVPMESPV GHQWYWTYEY TDFKDLSFDS YMTPTTDLPL GHFRLLEVDH RIVIPMESPI GHQWYWTYEY TDYGGLIFNS YMLPPLFLEP GDLRLLDVDN RVVLPIEAPI GHQWYWSYEY TDYENLSFDS YMIPTQDLTP GQFRLLETDH RMVVPMESPI GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI GHQWYWSYEY TDYEDLCFDS YMIPTNDLKP GELRLLEVDN RVVLPMELPI GHQWYWSYEY TDYEDLNFDS YMIPTQELKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TDYEDLSFDS YMIPTSDLKP GELRLLEVDN RVVLPMEMTI GHQWYWSYEY TNYEDLSFDS YMIPTNDLTP GQFRLLEVDN RMVVPMESPT RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSSRPG LYYGQCSEIC RVLVSAEDVL HSWAVPSLGV KMDAVPGRLN QAAFIASRPG VFYGQCSEIC RVIITADDVL HSWAVPALGV KTDAIPGRLN QTSFITTRPG VFYGQCSEIC RMMITSQDVL HSWAVPTLGL KTDAIPGRLN QTTFTATRPG VYYGQCSEIC RILVSAEDVL HSWALPAMGV KMDAVPGRLN QTAFIASRPG VFYGQCSEIC RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC RMLISSEDVL HSWAIPSLGL KTDAIPGRLN QATVTSNRPG LFYGQCSEIC RMLISSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMTMRPG LYYGQCSEIC RMLVSSEDVL HSWAVPSLGL KTDAIPGRLN QTTLMSTRPG LFYGQCSEIC RLLVTAEDVL HSWAVPSLGV KTDAIPGRLH QTSFIATRPG VFYGQCSEIC GSNHSFMPIV LELVPLKYFE KWSASML--- ---- GANHSFMPIV VEAVPLEHFE NWSSLMLEDA SLGS GANHSYMPIV VESTPLKHFE AWSSL----- -LSS GANHSFMPIV LELIPLKIFE M-------GP VFTL GANHSFMPIV VEAVPLSHFE NWSTLMLKDA SLGS GSNHSFMPIV LEMVPLKYFE NWSASMI--- ---- GSNHSFMPIV LEMVPLKYFE NWSASMI--- ---- GSNHSFMPIV LELVPLSHFE KWSTSML--- ---- GSNHSFMPIV LELVPLEVFE KWSVSML--- ---- GANHSFMPIV VEAVPLTDFE NWSSSML-EA SL-- 

Output:

 > Cow MAYPMQLGFQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSELKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSSRPGLYYGQCSEICGSNHSFMPIVLELVPLKYFEKWSASML > Carp MAHPTQLGFKDAAMPVMEELLHFHDHALMIVLLISTLVLYIITAMVSTKLTNKYILDSQEIEIVWTILPAVILVLIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLGFDSYMVPTQDLAPGQFRLLETDHRMVVPMESPVRVLVSAEDVLHSWAVPSLGVKMDAVPGRLNQAAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLEHFENWSSLMLEDASLGS > Chicken MANHSQLGFQDASSPIMEELVEFHDHALMVALAICSLVLYLLTLMLMEKLSSNTVDAQEVELIWTILPAIVLVLLALPSLQILYMMDEIDEPDLTLKAIGHQWYWTYEYTDFKDLSFDSYMTPTTDLPLGHFRLLEVDHRIVIPMESPIRVIITADDVLHSWAVPALGVKTDAIPGRLNQTSFITTRPGVFYGQCSEICGANHSYMPIVVESTPLKHFEAWSSLLSS > Human MAHAAQVGLQDATSPIMEELITFHDHALMIIFLICFLVLYALFLTLTTKLTNTNISDAQEMETVWTILPAIILVLIALPSLRILYMTDEVNDPSLTIKSIGHQWYWTYEYTDYGGLIFNSYMLPPLFLEPGDLRLLDVDNRVVLPIEAPIRMMITSQDVLHSWAVPTLGLKTDAIPGRLNQTTFTATRPGVYYGQCSEICGANHSFMPIVLELIPLKIFEMGPVFTL > Loach MAHPTQLGFQDAASPVMEELLHFHDHALMIVFLISALVLYVIITTVSTKLTNMYILDSQEIEIVWTVLPALILILIALPSLRILYLMDEINDPHLTIKAMGHQWYWSYEYTDYENLSFDSYMIPTQDLTPGQFRLLETDHRMVVPMESPIRILVSAEDVLHSWALPAMGVKMDAVPGRLNQTAFIASRPGVFYGQCSEICGANHSFMPIVVEAVPLSHFENWSTLMLKDASLGS > Mouse MAYPFQLGLQDATSPIMEELMNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILIMIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI > Rat MAYPFQLGLQDATSPIMEELTNFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETIWTILPAVILILIALPSLRILYMMDEINNPVLTVKTMGHQWYWSYEYTDYEDLCFDSYMIPTNDLKPGELRLLEVDNRVVLPMELPIRMLISSEDVLHSWAIPSLGLKTDAIPGRLNQATVTSNRPGLFYGQCSEICGSNHSFMPIVLEMVPLKYFENWSASMI > Seal MAYPLQMGLQDATSPIMEELLHFHDHTLMIVFLISSLVLYIISLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEINNPSLTVKTMGHQWYWSYEYTDYEDLNFDSYMIPTQELKPGELRLLEVDNRVVLPMEMTIRMLISSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMTMRPGLYYGQCSEICGSNHSFMPIVLELVPLSHFEKWSTSML > Whale MAYPFQLGFQDAASPIMEELLHFHDHTLMIVFLISSLVLYIITLMLTTKLTHTSTMDAQEVETVWTILPAIILILIALPSLRILYMMDEVNNPSLTVKTMGHQWYWSYEYTDYEDLSFDSYMIPTSDLKPGELRLLEVDNRVVLPMEMTIRMLVSSEDVLHSWAVPSLGLKTDAIPGRLNQTTLMSTRPGLFYGQCSEICGSNHSFMPIVLELVPLEVFEKWSVSML > Frog MAHPSQLGFQDAASPIMEELLHFHDHTLMAVFLISTLVLYIITIMMTTKLTNTNLMDAQEIEMVWTIMPAISLIMIALPSLRILYLMDEVNDPHLTIKAIGHQWYWSYEYTNYEDLSFDSYMIPTNDLTPGQFRLLEVDNRMVVPMESPTRLLVTAEDVLHSWAVPSLGVKTDAIPGRLHQTSFIATRPGVFYGQCSEICGANHSFMPIVVEAVPLTDFENWSSSMLEASL 
+1
source

Source: https://habr.com/ru/post/1469092/


All Articles