The fastest way to determine the most common factor in a grouped data frame in dplyr

I am trying to find the most frequent value within a group for several factor variables by summing a data frame in dplyr. I need a formula that does the following:

  • Find the most frequently used factor coefficient among all factors for one variable in the group (so basically "max ()" for calculating the coefficients).
  • If there is a connection between several of the most commonly used factors, select any of these levels of factors.
  • Returns the name of the factor level (not the number of samples).

There are several formulas. However, those that I could think of were slow. Quick methods are not always applicable to several variables in a data frame at the same time. I was wondering if anyone knows a quick method that goes well with dplyr.

I tried the following:

sample data generation (50,000 groups with 100 random letters)

z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE)) str(z) 'data.frame': 5000000 obs. of 2 variables: $ a: int 1 2 3 4 5 6 7 8 9 10 ... $ b: Factor w/ 26 levels "A","B","C","D",..: 6 4 14 12 3 19 17 19 15 20 ... 

Clean - Slow Approach 1

  y <- z %>% group_by(a) %>% summarise(c = names(table(b))[which.max(table(b))]) user system elapsed 26.772 2.011 29.568 

Clean - Slow Approach 2

 y <- z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])) user system elapsed 29.329 2.029 32.361 

Clean - Slow Approach 3

 y <- z %>% group_by(a) %>% summarise(c = names(sort(table(b),decreasing = TRUE)[1])) user system elapsed 35.086 6.905 42.485 

Messy - A Quick Approach

 y <- z %>% group_by(a,b) %>% summarise(counter = n()) %>% group_by(a) %>% filter(counter == max(counter)) y <- y[!duplicated(y$a),] y <- y$counter <- NULL user system elapsed 7.061 0.330 7.664 
+5
source share
4 answers

Why dplyr?

 #dummy data set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE)) #result names(sort(table(z$b),decreasing = TRUE)[1]) # [1] "S" #time it system.time( names(sort(table(z$b),decreasing = TRUE)[1]) ) # user system elapsed # 0.36 0.00 0.36 

EDIT: multiple columns

 #dummy data set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), c = sample(LETTERS, 5000000, replace = TRUE), d = sample(LETTERS, 5000000, replace = TRUE)) # check for multiple columns sapply(c("b","c","d"), function(i) names(sort(table(z[,i]),decreasing = TRUE)[1]) ) # bcd #"S" "N" "G" #time it system.time( sapply(c("b","c","d"), function(i) names(sort(table(z[,i]),decreasing = TRUE)[1])) ) # user system elapsed # 0.61 0.17 0.78 
+6
source

data.table is still the fastest choice for this:

 z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE)) 

Benchmarking:

 library(data.table) library(dplyr) #dplyr system.time({ y <- z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])) }) user system elapsed 14.52 0.01 14.70 #data.table system.time( setDT(z)[, .N, by=b][order(N),][.N,] ) user system elapsed 0.05 0.02 0.06 #@zx8754 way - base R system.time( names(sort(table(z$b),decreasing = TRUE)[1]) ) user system elapsed 0.73 0.06 0.81 

As can be seen using data.table with this:

  setDT(z)[, .N, by=b][order(N),][.N,] 

or

  #just to get the name setDT(z)[, .N, by=b][order(N),][.N, b] 

seems the fastest

Update for all columns:

Using data @ zx8754

 set.seed(123) z2 <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), c = sample(LETTERS, 5000000, replace = TRUE), d = sample(LETTERS, 5000000, replace = TRUE)) 

You can do:

 #with data.table system.time( sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] })) user system elapsed 0.34 0.00 0.34 #with base-R system.time( sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) ) user system elapsed 4.14 0.11 4.26 

And just to confirm the results, the same thing:

 sapply(c('b','c','d'), function(x) { data.table(x = z2[[x]])[, .N, by=x][order(N),][.N, x] }) bcd SNG sapply(c("b","c","d"), function(i) names(sort(table(z2[,i]),decreasing = TRUE)[1])) bcd "S" "N" "G" 
+6
source

Here is another option with dplyr :

 set.seed(123) z <- data.frame(a = rep(1:50000,100), b = sample(LETTERS, 5000000, replace = TRUE), stringsAsFactors = FALSE) a <- z %>% group_by(a, b) %>% summarise(c=n()) %>% filter(row_number(desc(c))==1) %>% .$bb <- z %>% group_by(a) %>% summarise(c=names(which(table(b) == max(table(b)))[1])) %>% .$c 

We make sure that these are equivalent approaches:

 > identical(a, b) #[1] TRUE 

Update

As pointed out by @docendodiscimus, you can also:

 count(z, a, b) %>% slice(which.max(n)) 

Here are the benchmark results:

 library(microbenchmark) mbm <- microbenchmark( steven = z %>% group_by(a, b) %>% summarise(c = n()) %>% filter(row_number(desc(c))==1), phil = z %>% group_by(a) %>% summarise(c = names(which(table(b) == max(table(b)))[1])), docendo = count(z, a, b) %>% slice(which.max(n)), times = 10 ) 

enter image description here

 #Unit: seconds # expr min lq mean median uq max neval cld # steven 4.752168 4.789564 4.815986 4.813686 4.847964 4.875109 10 b # phil 15.356051 15.378914 15.467534 15.458844 15.533385 15.606690 10 c # docendo 4.586096 4.611401 4.669375 4.688420 4.702352 4.753583 10 a 
+6
source

Following LyzandeR's suggestion, I will add another answer:

 require(data.table) setDT(z)[, .N, by=.(a,b)][order(-N), .(b=b[1L]), keyby=a] 
+4
source

Source: https://habr.com/ru/post/1232205/


All Articles