The fastest way to determine the most common factor in a grouped data frame in dplyr

Question

The fastest way to determine the most common factor in a grouped data frame in dplyr

I am trying to find the most frequent value within a group for several factor variables by summing a data frame in dplyr. I need a formula that does the following:

Find the most frequently used factor coefficient among all factors for one variable in the group (so basically "max ()" for calculating the coefficients).
If there is a connection between several of the most commonly used factors, select any of these levels of factors.
Returns the name of the factor level (not the number of samples).

There are several formulas. However, those that I could think of were slow. Quick methods are not always applicable to several variables in a data frame at the same time. I was wondering if anyone knows a quick method that goes well with dplyr.

I tried the following:

sample data generation (50,000 groups with 100 random letters)

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE)) str(z) 'data.frame': 5000000 obs. of 2 variables: $ a: int 1 2 3 4 5 6 7 8 9 10 ... $ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ...

Clean - Slow Approach 1

  y <- z %>% group_by(a) %>% summarise(c = names(table(b))[which.max(table(b))]) user system elapsed 26.772 2.011 29.568

Clean - Slow Approach 2

 y <- z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])) user system elapsed 29.329 2.029 32.361

Clean - Slow Approach 3

 y <- z %>% group_by(a) %>% summarise(c = names(sort(table(b),decreasing = TRUE)[1])) user system elapsed 35.086 6.905 42.485

Messy - A Quick Approach

 y <- z %>% group_by(a,b) %>% summarise(counter = n()) %>% group_by(a) %>% filter(counter == max(counter)) y <- y[!duplicated(y$a),] y <- y$counter <- NULL user system elapsed 7.061 0.330 7.664

+5

performance r dplyr

Phil 24 sept '15 at 16:20

source share

4 answers

zx8754 · Answer 1 · 2015-09-24T16:24:43+0000

Why dplyr?

 #dummy data set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE)) #result names(sort(table(z$b),decreasing = TRUE)[1]) # [1] "S" #time it system.time( names(sort(table(z$b),decreasing = TRUE)[1]) ) # user system elapsed # 0.36 0.00 0.36

EDIT: multiple columns

 #dummy data set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), c = sample(LETTERS, 5000000, replace = TRUE), d = sample(LETTERS, 5000000, replace = TRUE)) # check for multiple columns sapply(c("b","c","d"), function(i) names(sort(table(z[,i]),decreasing = TRUE)[1]) ) # bcd #"S" "N" "G" #time it system.time( sapply(c("b","c","d"), function(i) names(sort(table(z[,i]),decreasing = TRUE)[1])) ) # user system elapsed # 0.61 0.17 0.78

LyzandeR · Answer 2 · 2015-09-24T16:37:09+0000

data.table is still the fastest choice for this:

 z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE))

Benchmarking:

 library(data.table) library(dplyr) #dplyr system.time({ y <- z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])) }) user system elapsed 14.52 0.01 14.70 #data.table system.time( setDT(z)[, .N, by=b][order(N),][.N,] ) user system elapsed 0.05 0.02 0.06 #@zx8754 way - base R system.time( names(sort(table(z$b),decreasing = TRUE)[1]) ) user system elapsed 0.73 0.06 0.81

As can be seen using data.table with this:

  setDT(z)[, .N, by=b][order(N),][.N,]

or

  #just to get the name setDT(z)[, .N, by=b][order(N),][.N, b]

seems the fastest

Update for all columns:

Using data @ zx8754

 set.seed(123) z2 <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), c = sample(LETTERS, 5000000, replace = TRUE), d = sample(LETTERS, 5000000, replace = TRUE))

You can do:

 #with data.table system.time( sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] })) user system elapsed 0.34 0.00 0.34 #with base-R system.time( sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) ) user system elapsed 4.14 0.11 4.26

And just to confirm the results, the same thing:

 sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] }) bcd SNG sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) bcd "S" "N" "G"

Steven beaupré · Answer 3 · 2015-09-24T17:38:50+0000

Here is another option with dplyr :

 set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), stringsAsFactors = FALSE) a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$bb <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c

We make sure that these are equivalent approaches:

 > identical(a, b) #[1] TRUE

Update

As pointed out by @docendodiscimus, you can also:

 count(z, a, b) %>% slice(which.max(n))

Here are the benchmark results:

 library(microbenchmark) mbm <- microbenchmark( steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1), phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])), docendo = count(z, a, b) %>% slice(which.max(n)), times = 10 )

 #Unit: seconds # expr min lq mean median uq max neval cld # steven 4.752168 4.789564 4.815986 4.813686 4.847964 4.875109 10 b # phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690 10 c # docendo 4.586096 4.611401 4.669375 4.688420 4.702352 4.753583 10 a

Arun · Answer 4 · 2015-09-24T23:08:31+0000

Following LyzandeR's suggestion, I will add another answer:

 require(data.table) setDT(z)[, .N, by=.(a,b)][order(-N), .(b=b[1L]), keyby=a]

The fastest way to determine the most common factor in a grouped data frame in dplyr

More articles: