I use Perl to create a list of unique exons (which are units of genes).
I created a file in this format (with hundreds of thousands of lines):
chr1 1000 2000 gene1
chr1 3000 4000 gene2
chr1 5000 6000 gene3
chr1 1000 2000 gene4
Position 1 is the chromosome, position 2 is the initial coordinate of the exon, position 3 is the final coordinate of the exon, and position 4 is the name of the gene.
Since genes are often built from different exon arrangements, you have the same exon in several genes (see the first and fourth sets). I want to remove these "duplicates" - i.e. Remove gene1 or gene4 (it doesnβt matter which one to remove).
I hit my head against the wall for hours trying to do what (I think) is a simple task. Can someone point me in the right direction? I know that people often use hashes to remove duplicate elements, but these are not exactly duplicates (since the names of the genes are different). It is also important that I do not lose the gene name. Otherwise, it would be easier.
Here is the complete non-functional cycle I tried. In the exon array, each line is stored as a scalar, therefore a subroutine. Do not laugh. I know this does not work, but at least you can see (I hope) what I'm trying to do:
for (my $i = 0; $i < scalar @exons; $i++) { my @temp_line = line_splitter($exons[$i]);
}
source share