Efficient way to simultaneously get the number of unique values and total values for grouped values in dplyr

Question

Efficient way to simultaneously get the number of unique values and total values for grouped values in dplyr

I am interested in finding an effective way to get a summary of a group table that will contain:

The number of unique values for the group
primitive set of descriptive statistics for selected variables

For example, in the case of generating descriptive statistics, I use the following code:

data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp)

which will generate the desired result:

 > head(mt_sum) Source: local data frame [3 x 7] cyl hp_min wt_min disp_min hp_max wt_max disp_max (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) 1 4 52 1.513 71.1 113 3.190 146.7 2 6 105 2.620 145.0 175 3.460 258.0 3 8 150 3.170 275.8 335 5.424 472.0

I am interested in enriching data with a number that reflects the number of values for each group. As for the counter, this can be done simply:

 mt_sum2 <- mtcars %>% group_by(cyl) %>% summarise(countObs = n())

which will generate the required data:

 > head(mt_sum2) Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14

Problem

The problem arises when I would like to apply both transformations simultaneously.

Attempt 1

For example, the code:

 mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% summarise(countObs = n())

will generate:

 Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14

without descriptive statistics that were previously generated.

Attempt 2

Code:

 mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max,n), hp, wt, disp)

crash expected:

Error: n does not take arguments

Attempt 3 (worker)

Code:

 data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% left_join(y = data.frame( "Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)), "Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)), by = c("cyl" = "Var1"))

will provide the required data:

 > head(mt_sum) Source: local data frame [3 x 8] cyl hp_min wt_min disp_min hp_max wt_max disp_max Count (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) 1 4 52 1.513 71.1 113 3.190 146.7 11 2 6 105 2.620 145.0 175 3.460 258.0 7 3 8 150 3.170 275.8 335 5.424 472.0 14

I think this is an extremely inefficient way to create this resume. In particular, creating objects on the fly is inefficient when working with large tables. I am interested in achieving the same results, but in a more efficient way that would not include creating objects only for merge purposes. In particular, what I would like to do in dplyr would be consistent with deriving additional resumes from a previous version of the table. For instance:

Group
Creating Descriptive Statistics
Go back to the data after the group
Produce additional statistics and add to the final data

+5

r aggregate dataframe dplyr

Konrad Dec 7 '15 at 12:46

source share

1 answer

docendo discimus · Accepted Answer · 2015-12-07T13:03:48+0000

Here's another (shorter) option using left_join :

 mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% left_join(count(mtcars, cyl)) #Joining by: "cyl" #Source: local data frame [3 x 8] # # cyl hp_min wt_min disp_min hp_max wt_max disp_max n # (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (int) #1 4 52 1.513 71.1 113 3.190 146.7 11 #2 6 105 2.620 145.0 175 3.460 258.0 7 #3 8 150 3.170 275.8 335 5.424 472.0 14

Efficient way to simultaneously get the number of unique values ​​and total values ​​for grouped values ​​in dplyr

Problem

Attempt 1

Attempt 2

Attempt 3 (worker)

More articles:

Efficient way to simultaneously get the number of unique values and total values for grouped values in dplyr