Perl program to simulate restriction enzymes using links, hash tables and subtypes

Question

Perl program to simulate restriction enzymes using links, hash tables and subtypes

I participate in the Perl introclass. I am looking for suggestions on how to approach the assignment. My professor encourages forums. Appointment:

Write a Perl program that takes two files from the command line, an enzyme file and a DNA file. Read the restriction enzyme file and apply the restriction enzymes to the DNA file.
The output will be DNA fragments in the order they appear in the dna file. The name of the output files should be constructed by adding the name of the restriction enzyme to the DNA file name with an underscore between them.
For example, if the enzyme is EcoRI, and the DNA file is named BC161026, the output file must be named BC161026_EcoRI.

My approach is to create a main program and two subsystems as follows:

Main: Not sure how to tie my submarines together?

$ DNA routine: Take the DNA file and delete all new lines to create one line

Enzyme routine: Read and save the lines from the enzyme file, which is located on the command line. Parse the file so that it separates the acronym of the enzyme from the position of the cut. Store the cut position as a regular expression in the hash table Save the abbreviation name in the hash table

Note about the enzyme file format: The enzyme file follows the format known as Staden. Examples:
AatI/AGG'CCT//
   AatII/GACGT'C//
   AbsI/CC'TCGAGG//
(AatI, . - (AGG'CCT, , ). dna :
R = G    B = A (C G T)     ....

, - , ? , , , ?

: TryII/RRR'TTT//

: CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC

:

CCCCCCGGG
TTTCCCCCCCCCCCCAAA
TTTCCCCCCCCCCCCAGA
TTTCCCCCCCCCCGAG
TTTCCCCC

+3

reference regex perl hash bioinformatics

Koala 28 . '10 5:28

4

, , , , ( ).

; . - ( ) , . (, , , .)

+3

Beta 28 . '10 6:24

( ).
1) filehandles.
2) ,
3) " "
4) , 3.

#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];

open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
 chomp;
  if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
    my $pattern=$1;
    while (<DNA>) {
     chomp;
     #print $pattern;
     my @output=split/$pattern/,;
     print DNASEQ shift @output,"\n"; #first recognized sequence being output
     foreach (@output) {
        print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
     }
   }
 }
}
close DNA;
close ENZYME;
close DNASEQ;

+3

Philar 28 . '10 6:50

, , ... , :

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Long;

my ($enz_file, $dna_file);

GetOptions( "e=s" => \$enz_file,
            "d=s" => \$dna_file,
          );

if (! $enz_file || ! $dna_file) {
   # some help text 
   print STDERR<<EOF; 

   Usage: restriction.pl -e enzyme_file -d DNA_file

   The enzyme_file should contain one enzyme entry per line.
   The DNA_file may contain the sequence on one single or on
   several lines; all lines will be concatenated to yield a
   single string.
EOF      
   exit();
}

my %enz_and_patterns; # stores enzyme name and corresponding pattern

open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
   if (m#^(\w+)/([\w']+)//$#) {
      my $enzyme  = $1; # could also remove those two lines and use 
      my $pattern = $2; # the match variables directly, but this is clearer

      $enz_and_patterns{$enzyme} = $pattern;
   }
}
close ENZ;

my $dna_sequence;

open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
   chomp $line;
   $dna_sequence .= $line; # append the current bit to the sequence
                           # that has been read so far
}
close DNA;

foreach my $enzyme (keys %enz_and_patterns) {
   my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original

   # now translate the restriction pattern to a regular expression pattern:
   my $pattern = $enz_and_patterns{$enzyme};
   $pattern    =~ s/R/[GA]/g; # use character classes
   $pattern    =~ s/B/[^A]/g;
   $pattern    =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
                                         # we "remember" its position

   $dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
                                              # each ' with a newline character
   my $outfile = "${dna_file}_$enzyme";
   open OUT, ">$outfile" or die "Could not open file $outfile: $!";
   print OUT $dna_seq_processed , "\n";
   close OUT;
}

TryII, .

Since this is a task that can be handled by writing a few lines of non-repeating code, I did not feel that creating separate routines would be justified. I hope forgive me ... :)

+2

canavanin Nov 28 '10 at 20:29

source share

Joel Berger · Accepted Answer · 2010-11-28T17:36:50+0000

, , , , . , . , , - , . , , , , , .

#!/usr/bin/env perl

use strict;
use warnings;

use Getopt::Long;

my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
  'enzyme=s' => \$enzyme_file,
  'dna=s' => \$dna_file,
  'output' => \$write_output,
  'verbose' => \$verbose,
  'help' => \$help
);

$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits

# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
  print "Applying enzyme " . $enzyme . " gives:\n";
  my $dna_holder = $dna;
  my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});

  my $R = qr/[GA]/;
  my $B = qr/[CGT]/;

  $precut =~ s/R/${R}/g;
  $precut =~ s/B/${B}/g;
  $postcut =~ s/R/${R}/g;
  $postcut =~ s/B/${B}/g;
  print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
  print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;

  #while(1){
  #  if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
  #    print "\tFound section:" . $2 . "\n" if $verbose;
  #    print "\tRemaining DNA: " . $1 . "\n" if $verbose;
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $2;
  #  } else {
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
  #    print "\tNo more cuts.\n" if $verbose;
  #    print "\t" . $_ . "\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
  #    last;
  #  }
  #}
  unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
    print "\tHas no effect on given strand\n" if $verbose;
  }
  @{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
  print "\t$_\n" for @{ $enzymes{$enzyme}{'cut_dna'} };

  writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
  print "\n";
}

sub getDNA {
  my ($dna_file) = @_;

  open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
  my @dna_array = <$dna_handle>;
  chomp(@dna_array);

  my $dna = join('', @dna_array);

  print "Using DNA:\n" . $dna . "\n\n" if $verbose;
  return $dna;
}

sub getEnzymes {
  my ($enzyme_file) = @_;
  my %enzymes;

  open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
  while(<$enzyme_handle>) {
    chomp;
    if(m{([^/]*)/([^']*)'([^/]*)//}) {
      print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
      $enzymes{$1} = {
        precut => $2,
        postcut => $3,
        cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
      };
    }
  }

  print "\n" if $verbose;
  return \%enzymes;
}

sub writeOutput {

  my ($dna_file, $enzyme, $cut_dna_ref) = @_;

  my $outfile = $dna_file . '_' . $enzyme;
  print "\tSaving data to $outfile\n" if $verbose; 
  open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";

  print $outfile_handle $_ . "\n" for @{ $cut_dna_ref };
}

sub help {

  my $filename = (split('/', $0))[-1];

  my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END

  my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END

  print <<END;
Usage: 
    $filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
    $filename -h    (shows this help)

Options: 
    --verbose (-v)  print additional (debugging) information
    --output (-o)   output final data to files


Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:

$dna_text

The enzymes file constains enzyme definitions, one per line. E.G.:

$enzyme_text
END

exit;
}

Edit: cut_dna , , , .

2: , , .

3: , canavanin . , ('), . , - 5 .

4: . ()

my @names = ('cat','dog','sheep'); 
foreach my $name (@names) { #$name is lexical, ie dies after each loop
  open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
  print $handle $name; #write to the handle
  #$handles closes automatically when it "goes out of scope"
}

Perl program to simulate restriction enzymes using links, hash tables and subtypes

More articles: