Perl: removing duplicates from a large dataset

Question

Perl: removing duplicates from a large dataset

I use Perl to create a list of unique exons (which are units of genes).

I created a file in this format (with hundreds of thousands of lines):

chr1 1000 2000 gene1

chr1 3000 4000 gene2

chr1 5000 6000 gene3

chr1 1000 2000 gene4

Position 1 is the chromosome, position 2 is the initial coordinate of the exon, position 3 is the final coordinate of the exon, and position 4 is the name of the gene.

Since genes are often built from different exon arrangements, you have the same exon in several genes (see the first and fourth sets). I want to remove these "duplicates" - i.e. Remove gene1 or gene4 (it doesn’t matter which one to remove).

I hit my head against the wall for hours trying to do what (I think) is a simple task. Can someone point me in the right direction? I know that people often use hashes to remove duplicate elements, but these are not exactly duplicates (since the names of the genes are different). It is also important that I do not lose the gene name. Otherwise, it would be easier.

Here is the complete non-functional cycle I tried. In the exon array, each line is stored as a scalar, therefore a subroutine. Do not laugh. I know this does not work, but at least you can see (I hope) what I'm trying to do:

for (my $i = 0; $i < scalar @exons; $i++) { my @temp_line = line_splitter($exons[$i]); # runs subroutine turning scalar into array for (my $j = 0; $j < scalar @exons_dup; $j++) { my @inner_temp_line = line_splitter($exons_dup[$j]); # runs subroutine turning scalar into array unless (($temp_line[1] == $inner_temp_line[1]) && # this loop ensures that the the loop ($temp_line[3] eq $inner_temp_line[3])) { # below skips the identical lines if (($temp_line[1] == $inner_temp_line[1]) && # if the coordinates are the same ($temp_line[2] == $inner_temp_line[2])) { # between the comparisons splice(@exons, $i, 1); # delete the first one } } }

}

+6

perl duplicates bioinformatics

David m Apr 18 '11 at 21:01

source share

4 answers

You can use the hash to deduplicate en passant, but you need a way to join the parts you want to use to detect duplicates on the same line.

 sub extract_dup_check_string { my $exon = shift; my @parts = line_splitter($exon); # modify to suit: my $dup_check_string = join( ';', @parts[0..2] ); return $dup_check_string; } my %seen; @deduped_exons = grep !$seen{ extract_dup_check_string($_) }++, @exons;

+3

ysth Apr 18 '11 at 21:12

source share

You can use the hash to track duplicates that you have already seen, and then skip them. This example assumes the fields in your input file are separated by spaces:

 #!/usr/bin/env perl use strict; use warnings; my %seen; while (my $line = <>) { my($chromosome, $exon_start, $exon_end, $gene) = split /\s+/, $line; my $key = join ':', $chromosome, $exon_start, $exon_end; if ($seen{$key}) { next; } else { $seen{$key}++; print $line; } }

+1

Sam choukri Apr 18 '11 at 21:16

source share

Everything is simple. I tried to use as much magic as possible.

 my %exoms = (); my $input; open( $input, '<', "lines.in" ) or die $!; while( <$input> ) { if( $_ =~ /^(\w+\s+){3}(\w+)$/ ) #ignore lines that are not in expected format { my @splits = split( /\s+/, $_ ); #split line in $_ on multiple spaces my $key = $splits[1] . '_' . $splits[2]; if( !exists( $exoms{$key} ) ) { #could output or write to a new file here, probably output to a file #for large sets. $exoms{$key} = \@splits; } } } #demo to show what was parsed from demo input while( my ($key, $value) = each(%exoms) ) { my @splits = @{$value}; foreach my $position (@splits) { print( "$position " ); } print( "\n" ); }

0

Andrew Martinez Apr 18 '11 at 21:52

source share

dalton · Accepted Answer · 2011-04-18T21:27:34+0000

 my @exons = ( 'chr1 1000 2000 gene1', 'chr1 3000 4000 gene2', 'chr1 5000 6000 gene3', 'chr1 1000 2000 gene4' ); my %unique_exons = map { my ($chro, $scoor, $ecoor, $gene) = (split(/\s+/, $_)); "$chro $scoor $ecoor" => $gene } @exons; print "$_ $unique_exons{$_} \n" for keys %unique_exons;

This will give you uniqueness and the last gene name will be included. It leads to:

 chr1 1000 2000 gene4 chr1 5000 6000 gene3 chr1 3000 4000 gene2

Perl: removing duplicates from a large dataset

More articles: