How to merge two FASTA files (one file with line break) in Perl?

I have the following two Fasta files:

file1.fasta p>

>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC

file2.qual

>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5

Note the line break in the qual file for each fasta header marked with a>. The number of file headers ('>') is the same for both files. Number of numeric qualities = sequence length.

What I want to do is add these two files:

GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC  40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5

But for some reason, my code below cannot do it right? Especially the second line of each entry in the qual file is not printed.

use strict;
use Data::Dumper;        
use Carp;
use File::Basename;      

my $fastafile = $ARGV[0] || "reads/2039F.2.fasta"; 
my $base      = basename( $fastafile, ".fasta" );
my $qualfile  = "reads/" . $base . ".qual";
print "$qualfile\n";

open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality


while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);
     chomp($qual);

     if ($seq =~ /^>/ || $qual =~ /^>/) {
         next;
     }
     else {
         print "$seq\t$qual\n";      
     }

}

What is the right way to do this?

+3
source share
3 answers

( ) , . FASTA - /:

local $/ = "\n>";
while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
     chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;

     print "$seq\t$qual\n";      

}

FASTA .

+3

, , , " > " ​​ , " > " ​​.

, , , :

1: >0 
2: >0
1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: >1
2: 40 40 40 40 40 40 40 40 15 40 40
1: GTTAAGTTATATCAAACTAAATATACATACTATAAA
2: >1
1: >2
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40
1: EOF
2: >2
1: EOF
2: 40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
1: EOF
2: 40 8 3 29 10 19 18 40 19 15 5

, , :

1: GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
2: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
1: GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC
2: 40 40 40 40 40 40 40 40 40 40 40

, , .

, .

fileIO: {
  while( 1 ){ 
   my $seq; 
   my $qual  = q{};
   while( 1 ){ 
     $seq = <SEQ>; 
     last fileIO if not $seq;  # stop at end of file
     last if $seq !~ /^>/; 
  }
  while( 1 ){ 
     my $qual_in = <PRB>;
     last fileIO if not $qual_in; # stop at end of file 
     last if $qual_in =~ /^>/ and $qual ne q{}; 
     next if $qual_in =~ /^>/ and $qual eq q{}; 
     $qual .= $qual_in;
  }
  print "$seq \n $qual \n";

 }
}

Update

, , -, . , , , - .

use strict;
use warnings;

# 
#  readUntilNext( $fileHandle, \$scalar_ref ); 
#
#  returns 0 when nothing could be read from the fileHandle. 
#  otherwise returns 1; 
#

sub readUntilNext {
    my ($fh)            = shift;
    my ($output)        = shift;
    my ($output_buffer) = '';
    while (1) {
        my $line = <$fh>;
        if ( !$line ) { # No more data
            # No data to flush to user, return false.
            return 0 if $output_buffer eq q{};
            last;  # data to  flush to user, loop exit. 
        }
        if ( $line =~ /^>/ ) {
            # Didn't get anything, keep looking. 
            next if $output_buffer eq q{};
            # Got something, flush data to user. 
            last;
        }
        chomp($line);
        $output_buffer .= $line;
    }
    # Data to flush to user 
    # Write to the scalar-reference 
    $$output .= $output_buffer;
    return 1;
}

open my $m, '<', 'a.txt';
open my $n , '<', 'b.txt';
# Creates 2 scalar references every loop, and only loops as long 
# as both files have data. 
while ( readUntilNext( $m, \my $seq ) && readUntilNext( $n, \my $qual ) ) {
    print "$seq\t$qual\n";
}

, , , .

, \my stuff

while( readUntilNext( $m, \my $seq ) ) { 
}

my $seq; 
while( readUntilNext( $m, \$seq ) ) { 
}

, , , ;

:

while( 1 ){ 
 my $seq; 
 last if not readUntilNext($m, \$seq);
 do { 
    # loop body here
 }
}
+8

, perl, :

prompt>grep -v '^>[0-9]' file1.fasta > tmp1
prompt>(tr '\012' ' ' < file2.qual; echo) | sed 's/>[0-9]* /\n/g' | sed 1d > tmp2
prompt>paste tmp1 tmp2
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA    40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC    40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5
prompt>

paste (, , - - ").

The second command line first converts all newline characters to spaces, and an echo command is added to add the final newline to the input (because sed will ignore lines that do not have EOL), thereby combining all input lines into one line, which is then a command sed breaks up again (note on portability: not all sed programs will work with an arbitrary string length, but GNU sed does).

+4
source

Source: https://habr.com/ru/post/1706223/


All Articles