Duplicates in a column: randomly keep one

I have a file ( input.txt) with a structure like this:

 abc    1
 bcd    a
 cde    1
 def    4
 efg    a
 fgh    3

I want to remove duplicates in column 2 to only have unique rows in this column (no matter what is in column 1). But the selected row must be selected aleatory . The output could be, for example:

 bcd    a
 cde    1
 def    4
 fgh    3

I tried to create a file listing the duplicates (using awk '{print $2}' input.txt | sort | uniq -D | uniq), but then I managed to delete all of them using awk '!A[$2]++'instead of accidentally storing one of the duplicates.

+4
source share
3 answers

Preprocess the input to randomize it:

shuf input.txt | awk '!A[$2]++'
+5
source

GNU awk :

$ awk '{a[$2][++cnt[$2]]=$0} END{srand(); for (k in a) print a[k][int(rand()*cnt[k])+1]}' file
 efg    a
 cde    1
 fgh    3
 def    4

awks:

$ awk '{keys[$2]; a[$2,++cnt[$2]]=$0} END{srand(); for (k in keys) print a[k,int(rand()*cnt[k])+1]}' file
 bcd    a
 abc    1
 fgh    3
 def    4
+1

perl

$ perl -MList::Util=shuffle -e 'print grep { !$seen{(split)[1]}++ } shuffle <>' input.txt
 def    4
 fgh    3
 bcd    a
 abc    1
  • -MList::Util=shuffle, shuffle List::Util
  • shuffle <> <> ,
  • grep { !$seen{(split)[1]}++ } to filter strings based on the 2nd field of each element of the array based on spaces as a separator


WITH ruby

$ ruby -e 'puts readlines.shuffle.uniq {|s| s.split[1]}' input.txt
 abc    1
 bcd    a
 fgh    3
 def    4
  • readlines will receive all lines from the input file as an array
  • shuffle for randomization of elements
  • uniqto get unique items
    • {|s| s.split[1]} based on the value of the second field using space as a separator
  • puts for printing array elements
+1
source

Source: https://habr.com/ru/post/1694972/


All Articles