Delete rows with duplicate values in recent

Question

Delete rows with duplicate values in recent

I have a tab delimited file that looks like

chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2 AKR7A2 PQLC2

I want the rows where column4 is repeated to be deleted.

The first three columns are the coordinates and in the coordinates that we find are listed (in col4), and for each coordinate I want to have only unique names, not a duplicate of the names.

I need a conclusion like this

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2

Things i tried

 sort -k 4 -u file awk '{if($4==temp1){next;}else{print}temp1=$4}' file

Nothing works:(

Please, help

thanks

+4

python bash awk perl bioinformatics

Angelo Nov 19 '12 at 14:56

source share

7 answers

 sed '1{x;d};H;x;s/\([ ][^\n ]*\)[ ]*\n[ ]*\1[ ]*\n/\1\n/;$p;x;d;$p;x;d' FILE

If you have tabs in your file except spaces, you can replace all [ ] with [[:space:]] .

+2

alinsoar Nov 19 '12 at 17:30

source share

Using a small perl script:

 perl -e 'my $col4 = ""; while (<>) { chomp; my @f = split(/\t/, $_); if ($f[3] eq "" || $f[3] ne $col4) { print $_, "\n"; } $col4 = $f[3]; }' input.txt

result:

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2

+1

Erikr Nov 19 '12 at 15:03

source share

simple `awk` script

 awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' input.txt

Result

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2

cleaning

To prepare the input.txt file, I copied the text from the question. But I had to replace the spaces with tabs. Therefore, I used the sed command. I also noticed some trailing spaces (at the end of the line). Finally, I used the following sed command to clean up the imput file:

 sed 's/ *$//;/^[^ ]/s/ */\t/g;/^ /s/ */\t\t\t/g;' copy-fron-so.txt > input.txt

input file from @dogbane comment

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2 AKR7A2

(last line added)

cleaning and treatment

 $> sed 's/ *$//;/^[^ ]/s/ */\t/g;/^ /s/ */\t\t\t/g;' copypaste.txt > input.txt $> awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' input.txt chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 AKR7A2

change of requirements

The last line with AKR7A2 should not be printed. Therefore, we need to sort the input.txt file input.txt . Caution, the -t option is for entering a tab, on bash or vi press [CTRL-V] , then [TAB] (put quotation marks around this tab).

 $> LANG=C sort -k 4 -s -t ' ' input.txt > sorted.txt $> awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' sorted.txt chr1 17051560 17052060 chr1 17053279 17053779 chr1 19638239 19638739 AKR7A2 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 PQLC2 chr1 12226559 12227059 TNFRSF1B

Note that there is now a single line ending in MRTO4 !

+1

oliber Nov 19 '12 at 15:42

source share

Perhaps the following will help:

 use strict; use warnings; my %seen; while (<DATA>) { my ($col3) = (split)[-1]; print if !$seen{$col3}++ or !$col3; } __DATA__ chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2

Output:

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2

This output can also be achieved using the following single-line interface:

 perl -ane "print if !$X{$F[-1]}++ or !$F[-1]" data.txt

+1

Kenosis Nov 19 '12 at 17:41

source share

Given the new input, I would use:

 gawk -F'\t' '!/^\t/{delete a} !a[$4]++' file

I use gawk so that I can clearly delete the entire array at a time, while other awks use the less clear one:

 awk -F'\t' '!/^\t/{split("",a)} !a[$4]++' file

+1

Ed morton Nov 20 '12 at 15:47

source share

If duplicate rows are duplicated in all columns, not just the fourth, uniq (1) might be appropriate. Try running only uniq file and see if the result is what you expect.

0

pndc Nov 19 '12 at 15:02

source share

glenn jackman · Accepted Answer · 2012-11-19T17:59:42+0000

You just need

 awk '$NF != prev {print} {prev=$NF}'

EDIT: to handle new input

 awk '{ if (NF == 1) value = $1 else { key = $1 SUBSEP $2 SUBSEP $3 value = $4 } if ((key SUBSEP value) in val) next print val[key, value] = 1 }' input

Delete rows with duplicate values ​​in recent

simple awk script

Result

cleaning

input file from @dogbane comment

cleaning and treatment

change of requirements

More articles:

Delete rows with duplicate values in recent

simple `awk` script