Randomly delete duplicate rows with dplyr ()

Question

As a follow-up question to this: Remove duplicate lines with dplyr , I have the following:

How do you arbitrarily delete duplicate rows using dplyr () (among others)?

Now my command:

data.uniques <- distinct(data, KEYVARIABLE, .keep_all = TRUE)

But it returns the first occurrence of KEYVARIABLE. I want this behavior to be random: so somewhere between 1and ncases of this KEYVARIABLE.

For instance:

KEYVARIABLE BMI
1 24.2
2 25.3
2 23.2
3 18.9
4 19
4 20.1
5 23.0

My command currently returns:

KEYVARIABLE BMI
1 24.2
2 25.3
3 18.9
4 19
5 23.0

I want it to randomly return one of the duplicate rows n, for example:

KEYVARIABLE BMI
1 24.2
2 23.2
3 18.9
4 19
5 23.0

+4

3 answers

( distinct).

library(dplyr)
distinct(df[sample(1:nrow(df)), ], 
         KEYVARIABLE, 
         .keep_all = TRUE)

+6

PoGibas 21 . '17 20:09

Using dplyr

df%>%dplyr::mutate(A=sample(1:dim(df)[1]))%>%group_by(KEYVARIABLE)%>%dplyr::slice(which.min(A))

+1

Wen Aug 21 '17 at 20:13

akrun · Accepted Answer · 2017-08-21T20:08:19+0000

, "KEYVARIABLE", sample

library(data.table)
setDT(df1)[, .SD[sample(.N)[1]], KEYVARIABLE]

dplyr

library(dplyr)
df1 %>% 
   group_by(KEYVARIABLE) %>%
   sample_n(1)