Efficient way to simultaneously get the number of unique values โ€‹โ€‹and total values โ€‹โ€‹for grouped values โ€‹โ€‹in dplyr

I am interested in finding an effective way to get a summary of a group table that will contain:

  • The number of unique values โ€‹โ€‹for the group
  • primitive set of descriptive statistics for selected variables

For example, in the case of generating descriptive statistics, I use the following code:

data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) 

which will generate the desired result:

 > head(mt_sum) Source: local data frame [3 x 7] cyl hp_min wt_min disp_min hp_max wt_max disp_max (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) 1 4 52 1.513 71.1 113 3.190 146.7 2 6 105 2.620 145.0 175 3.460 258.0 3 8 150 3.170 275.8 335 5.424 472.0 

I am interested in enriching data with a number that reflects the number of values โ€‹โ€‹for each group. As for the counter, this can be done simply:

 mt_sum2 <- mtcars %>% group_by(cyl) %>% summarise(countObs = n()) 

which will generate the required data:

 > head(mt_sum2) Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14 

Problem

The problem arises when I would like to apply both transformations simultaneously.

Attempt 1

For example, the code:

 mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% summarise(countObs = n()) 

will generate:

 Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14 

without descriptive statistics that were previously generated.

Attempt 2

Code:

 mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max,n), hp, wt, disp) 

crash expected:

Error: n does not take arguments

Attempt 3 (worker)

Code:

 data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% left_join(y = data.frame( "Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)), "Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)), by = c("cyl" = "Var1")) 

will provide the required data:

 > head(mt_sum) Source: local data frame [3 x 8] cyl hp_min wt_min disp_min hp_max wt_max disp_max Count (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) 1 4 52 1.513 71.1 113 3.190 146.7 11 2 6 105 2.620 145.0 175 3.460 258.0 7 3 8 150 3.170 275.8 335 5.424 472.0 14 

I think this is an extremely inefficient way to create this resume. In particular, creating objects on the fly is inefficient when working with large tables. I am interested in achieving the same results, but in a more efficient way that would not include creating objects only for merge purposes. In particular, what I would like to do in dplyr would be consistent with deriving additional resumes from a previous version of the table. For instance:

  • Group
  • Creating Descriptive Statistics
  • Go back to the data after the group
  • Produce additional statistics and add to the final data
+5
source share
1 answer

Here's another (shorter) option using left_join :

 mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% left_join(count(mtcars, cyl)) #Joining by: "cyl" #Source: local data frame [3 x 8] # # cyl hp_min wt_min disp_min hp_max wt_max disp_max n # (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) #1 4 52 1.513 71.1 113 3.190 146.7 11 #2 6 105 2.620 145.0 175 3.460 258.0 7 #3 8 150 3.170 275.8 335 5.424 472.0 14 
+3
source

Source: https://habr.com/ru/post/1237575/


All Articles