Duplicates in a column: randomly keep one

Question

Duplicates in a column: randomly keep one

I have a file ( input.txt) with a structure like this:

 abc    1
 bcd    a
 cde    1
 def    4
 efg    a
 fgh    3

I want to remove duplicates in column 2 to only have unique rows in this column (no matter what is in column 1). But the selected row must be selected aleatory . The output could be, for example:

 bcd    a
 cde    1
 def    4
 fgh    3

I tried to create a file listing the duplicates (using awk '{print $2}' input.txt | sort | uniq -D | uniq), but then I managed to delete all of them using awk '!A[$2]++'instead of accidentally storing one of the duplicates.

+4

bash awk duplicates multiple-columns

Svalf Mar 16 '18 at 14:42

source share

3 answers

GNU awk :

$ awk '{a[$2][++cnt[$2]]=$0} END{srand(); for (k in a) print a[k][int(rand()*cnt[k])+1]}' file
 efg    a
 cde    1
 fgh    3
 def    4

awks:

$ awk '{keys[$2]; a[$2,++cnt[$2]]=$0} END{srand(); for (k in keys) print a[k,int(rand()*cnt[k])+1]}' file
 bcd    a
 abc    1
 fgh    3
 def    4

+1

Ed Morton 16 . '18 14:55

perl

$ perl -MList::Util=shuffle -e 'print grep { !$seen{(split)[1]}++ } shuffle <>' input.txt
 def    4
 fgh    3
 bcd    a
 abc    1

-MList::Util=shuffle, shuffle List::Util
shuffle <> <> ,
grep { !$seen{(split)[1]}++ } to filter strings based on the 2nd field of each element of the array based on spaces as a separator

WITH ruby

$ ruby -e 'puts readlines.shuffle.uniq {|s| s.split[1]}' input.txt
 abc    1
 bcd    a
 fgh    3
 def    4

readlines will receive all lines from the input file as an array
shuffle for randomization of elements
uniqto get unique items
- {|s| s.split[1]} based on the value of the second field using space as a separator
puts for printing array elements

+1

Sundeep Mar 16 '18 at 15:38

source share

William Pursell · Accepted Answer · 2018-03-16T14:52:53+0000

Preprocess the input to randomize it:

shuf input.txt | awk '!A[$2]++'

Duplicates in a column: randomly keep one

More articles: