Dplyr: The difference between unique and individual

Question

Dplyr: The difference between unique and individual

It seems the number of resulting rows is different when using unique unique ones. The dataset I'm working with is huge. Hope the code is ok to understand.

dt2a <- select(dt, mutation.genome.position, 
  mutation.cds, primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>% 
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382       5

## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds, 
   primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>%
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982       5

This is the file I'm working with:

SFTP: //sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz

     dt = fread(fl)

+2

r data.table dplyr

Sahil seth May 22, '15 at 16:01

source share

1 answer

Mrflick · Accepted Answer · 2015-05-22T18:27:24+0000

This, apparently, is the result. group_byConsider this case.

dt<-data.frame(g=rep(c("a","b"), each=3),
    v=c(2,2,5,2,7,7))

dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7

dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 b 2

dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7

When you use distinct()without specifying which variables should be allocated, it appears to use a grouping variable.

Dplyr: The difference between unique and individual

More articles: