I am interested in finding an effective way to get a summary of a group table that will contain:
- The number of unique values โโfor the group
- primitive set of descriptive statistics for selected variables
For example, in the case of generating descriptive statistics, I use the following code:
data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp)
which will generate the desired result:
> head(mt_sum) Source: local data frame [3 x 7] cyl hp_min wt_min disp_min hp_max wt_max disp_max (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) 1 4 52 1.513 71.1 113 3.190 146.7 2 6 105 2.620 145.0 175 3.460 258.0 3 8 150 3.170 275.8 335 5.424 472.0
I am interested in enriching data with a number that reflects the number of values โโfor each group. As for the counter, this can be done simply:
mt_sum2 <- mtcars %>% group_by(cyl) %>% summarise(countObs = n())
which will generate the required data:
> head(mt_sum2) Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14
Problem
The problem arises when I would like to apply both transformations simultaneously.
Attempt 1
For example, the code:
mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% summarise(countObs = n())
will generate:
Source: local data frame [3 x 2] cyl countObs (dbl) (int) 1 4 11 2 6 7 3 8 14
without descriptive statistics that were previously generated.
Attempt 2
Code:
mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max,n), hp, wt, disp)
crash expected:
Error: n does not take arguments
Attempt 3 (worker)
Code:
data("mtcars") require(dplyr) mt_sum <- mtcars %>% group_by(cyl) %>% summarise_each(funs(min,max), hp, wt, disp) %>% left_join(y = data.frame( "Var1" = as.numeric(as.character(as.data.frame(table(mtcars$cyl))$Var1)), "Count" = as.character(as.data.frame(table(mtcars$cyl))$Freq)), by = c("cyl" = "Var1"))
will provide the required data:
> head(mt_sum) Source: local data frame [3 x 8] cyl hp_min wt_min disp_min hp_max wt_max disp_max Count (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) 1 4 52 1.513 71.1 113 3.190 146.7 11 2 6 105 2.620 145.0 175 3.460 258.0 7 3 8 150 3.170 275.8 335 5.424 472.0 14
I think this is an extremely inefficient way to create this resume. In particular, creating objects on the fly is inefficient when working with large tables. I am interested in achieving the same results, but in a more efficient way that would not include creating objects only for merge purposes. In particular, what I would like to do in dplyr would be consistent with deriving additional resumes from a previous version of the table. For instance:
- Group
- Creating Descriptive Statistics
- Go back to the data after the group
- Produce additional statistics and add to the final data