Delete rows with duplicate values ​​in recent

I have a tab delimited file that looks like

chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2 AKR7A2 PQLC2 

I want the rows where column4 is repeated to be deleted.

The first three columns are the coordinates and in the coordinates that we find are listed (in col4), and for each coordinate I want to have only unique names, not a duplicate of the names.

I need a conclusion like this

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 

Things i tried

 sort -k 4 -u file awk '{if($4==temp1){next;}else{print}temp1=$4}' file 

Nothing works:(

Please, help

thanks

+4
source share
7 answers

You just need

 awk '$NF != prev {print} {prev=$NF}' 

EDIT: to handle new input

 awk '{ if (NF == 1) value = $1 else { key = $1 SUBSEP $2 SUBSEP $3 value = $4 } if ((key SUBSEP value) in val) next print val[key, value] = 1 }' input 
+4
source
 sed '1{x;d};H;x;s/\([ ][^\n ]*\)[ ]*\n[ ]*\1[ ]*\n/\1\n/;$p;x;d;$p;x;d' FILE 

If you have tabs in your file except spaces, you can replace all [ ] with [[:space:]] .

+2
source

Using a small perl script:

 perl -e 'my $col4 = ""; while (<>) { chomp; my @f = split(/\t/, $_); if ($f[3] eq "" || $f[3] ne $col4) { print $_, "\n"; } $col4 = $f[3]; }' input.txt 

result:

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 
+1
source

simple awk script

 awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' input.txt 

Result

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 

cleaning

To prepare the input.txt file, I copied the text from the question. But I had to replace the spaces with tabs. Therefore, I used the sed command. I also noticed some trailing spaces (at the end of the line). Finally, I used the following sed command to clean up the imput file:

 sed 's/ *$//;/^[^ ]/s/ */\t/g;/^ /s/ */\t\t\t/g;' copy-fron-so.txt > input.txt 

input file from @dogbane comment

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2 AKR7A2 

(last line added)

cleaning and treatment

 $> sed 's/ *$//;/^[^ ]/s/ */\t/g;/^ /s/ */\t\t\t/g;' copypaste.txt > input.txt $> awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' input.txt chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 AKR7A2 

change of requirements

The last line with AKR7A2 should not be printed. Therefore, we need to sort the input.txt file input.txt . Caution, the -t option is for entering a tab, on bash or vi press [CTRL-V] , then [TAB] (put quotation marks around this tab).

 $> LANG=C sort -k 4 -s -t ' ' input.txt > sorted.txt $> awk -F'\t' '{OFS="\t"; if ($4=="" || $4!=old) print; old=$4}' sorted.txt chr1 17051560 17052060 chr1 17053279 17053779 chr1 19638239 19638739 AKR7A2 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 PQLC2 chr1 12226559 12227059 TNFRSF1B 

Note that there is now a single line ending in MRTO4 !

+1
source

Perhaps the following will help:

 use strict; use warnings; my %seen; while (<DATA>) { my ($col3) = (split)[-1]; print if !$seen{$col3}++ or !$col3; } __DATA__ chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 ATP13A2 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19578046 19578546 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 PQLC2 PQLC2 

Output:

 chr1 12226559 12227059 TNFRSF1B chr1 17051560 17052060 chr1 17053279 17053779 chr1 17338423 17338923 ATP13A2 chr1 19577574 19578074 EMC1 MRTO4 chr1 19638239 19638739 AKR7A2 PQLC2 

This output can also be achieved using the following single-line interface:

 perl -ane "print if !$X{$F[-1]}++ or !$F[-1]" data.txt 
+1
source

Given the new input, I would use:

 gawk -F'\t' '!/^\t/{delete a} !a[$4]++' file 

I use gawk so that I can clearly delete the entire array at a time, while other awks use the less clear one:

 awk -F'\t' '!/^\t/{split("",a)} !a[$4]++' file 
+1
source

If duplicate rows are duplicated in all columns, not just the fourth, uniq (1) might be appropriate. Try running only uniq file and see if the result is what you expect.

0
source

Source: https://habr.com/ru/post/1446979/


All Articles