data.table is still the fastest choice for this:
z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))
Benchmarking:
library(data.table) library(dplyr) #dplyr system.time({ y <- z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])) }) user system elapsed 14.52 0.01 14.70 #data.table system.time( setDT(z)[, .N, by=b][order(N),][.N,] ) user system elapsed 0.05 0.02 0.06 #@zx8754 way - base R system.time( names(sort(table(z$b),decreasing = TRUE)[1]) ) user system elapsed 0.73 0.06 0.81
As can be seen using data.table with this:
setDT(z)[, .N, by=b][order(N),][.N,]
or
#just to get the name setDT(z)[, .N, by=b][order(N),][.N, b]
seems the fastest
Update for all columns:
Using data @ zx8754
set.seed(123) z2 <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), c = sample(LETTERS, 5000000, replace = TRUE), d = sample(LETTERS, 5000000, replace = TRUE))
You can do:
#with data.table system.time( sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] })) user system elapsed 0.34 0.00 0.34 #with base-R system.time( sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) ) user system elapsed 4.14 0.11 4.26
And just to confirm the results, the same thing:
sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] }) bcd SNG sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) bcd "S" "N" "G"