Dplyr: The difference between unique and individual

It seems the number of resulting rows is different when using unique unique ones. The dataset I'm working with is huge. Hope the code is ok to understand.

dt2a <- select(dt, mutation.genome.position, 
  mutation.cds, primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>% 
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% distinct()
dim(dt2a)
[1] 2316382       5

## Using unique instead
dt2b <- select(dt, mutation.genome.position, mutation.cds, 
   primary.site, sample.name, mutation.id) %>%
  group_by(mutation.genome.position, mutation.cds, primary.site) %>%
  mutate(occ = nrow(.)) %>%
  select(-sample.name) %>% unique()
dim(dt2b)
[1] 2837982       5

This is the file I'm working with:

SFTP: //sftp-cancer.sanger.ac.uk/files/grch38/cosmic/v72/CosmicMutantExport.tsv.gz

     dt = fread(fl)
+2
source share
1 answer

This, apparently, is the result. group_byConsider this case.

dt<-data.frame(g=rep(c("a","b"), each=3),
    v=c(2,2,5,2,7,7))

dt %>% group_by(g) %>% unique()
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7

dt %>% group_by(g) %>% distinct()
# Source: local data frame [2 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 b 2

dt %>% group_by(g) %>% distinct(v)
# Source: local data frame [4 x 2]
# Groups: g
# 
#   g v
# 1 a 2
# 2 a 5
# 3 b 2
# 4 b 7

When you use distinct()without specifying which variables should be allocated, it appears to use a grouping variable.

+8
source

Source: https://habr.com/ru/post/1681815/


All Articles