Dplyr summarize: create variables from a named vector

Here is my problem:

I use a function that returns a named vector. Here is an example of a toy:

toy_fn <- function(x) { y <- c(mean(x), sum(x), median(x), sd(x)) names(y) <- c("Right", "Wrong", "Unanswered", "Invalid") y } 

I use group_by in dplyr to apply this function to each group (typical split-apply-comb). So here is my data.frame toy:

 set.seed(1234567) toy_df <- data.frame(id = 1:1000, group = sample(letters, 1000, replace = TRUE), value = runif(1000)) 

And here is the result I was aiming for:

 toy_summary <- toy_df %>% group_by(group) %>% summarize(Right = toy_fn(value)["Right"], Wrong = toy_fn(value)["Wrong"], Unanswered = toy_fn(value)["Unanswered"], Invalid = toy_fn(value)["Invalid"]) > toy_summary Source: local data frame [26 x 5] group Right Wrong Unanswered Invalid 1 a 0.5038394 20.15358 0.5905526 0.2846468 2 b 0.5048040 15.64892 0.5163702 0.2994544 3 c 0.5029442 21.62660 0.5072733 0.2465612 4 d 0.5124601 14.86134 0.5382463 0.2681955 5 e 0.4649483 17.66804 0.4426197 0.3075080 6 f 0.5622644 12.36982 0.6330269 0.2850609 7 g 0.4675324 14.96104 0.4692404 0.2746589 

It works! But it’s just not great to call the same function four times. I would prefer dplyr to get a named vector and create a new variable for each element of the vector. Something like that:

 toy_summary <- toy_df %>% group_by(group) %>% summarize(toy_fn(value)) 

This, unfortunately, does not work, because "Error: one value expected."

I thought, well, just translate the vector into data.frame using data.frame(as.list(x)) . But that doesn't work either. I tried many things, but I couldn’t trick dplyr into thinking that it actually gets one value (observation) for 4 different variables. Is there any way to help dplyr understand this?

+6
source share
5 answers

You can also try this with do() :

 toy_df %>% group_by(group) %>% do(res = toy_fn(.$value)) 
+2
source

One possible solution is to use the capabilities of dplyr SE . For example, set the function as follows

 dots <- setNames(list( ~ mean(value), ~ sum(value), ~ median(value), ~ sd(value)), c("Right", "Wrong", "Unanswered", "Invalid")) 

Then you can use summarize_ (with _ ) as follows

 toy_df %>% group_by(group) %>% summarize_(.dots = dots) # Source: local data table [26 x 5] # # group Right Wrong Unanswered Invalid # 1 o 0.4490776 17.51403 0.4012057 0.2749956 # 2 s 0.5079569 15.23871 0.4663852 0.2555774 # 3 x 0.4620649 14.78608 0.4475117 0.2894502 # 4 a 0.5038394 20.15358 0.5905526 0.2846468 # 5 t 0.5041168 24.19761 0.5330790 0.3171022 # 6 m 0.4806628 21.14917 0.4805273 0.2825026 # 7 c 0.5029442 21.62660 0.5072733 0.2465612 # 8 w 0.4932484 17.75694 0.4891746 0.3309680 # 9 q 0.5350707 22.47297 0.5608505 0.2749941 # 10 g 0.4675324 14.96104 0.4692404 0.2746589 # .. ... ... ... ... ... 

Although it looks good, there is a big catch. You must know the column that you are going to use a priori ( value ) when setting up the function, so it will not work on any other column name if you do not configure dots correctly.


As a bonus, here is a simple solution using data.table using your original function

 library(data.table) setDT(toy_df)[, as.list(toy_fn(value)), by = group] # group Right Wrong Unanswered Invalid # 1: o 0.4490776 17.51403 0.4012057 0.2749956 # 2: s 0.5079569 15.23871 0.4663852 0.2555774 # 3: x 0.4620649 14.78608 0.4475117 0.2894502 # 4: a 0.5038394 20.15358 0.5905526 0.2846468 # 5: t 0.5041168 24.19761 0.5330790 0.3171022 # 6: m 0.4806628 21.14917 0.4805273 0.2825026 # 7: c 0.5029442 21.62660 0.5072733 0.2465612 # 8: w 0.4932484 17.75694 0.4891746 0.3309680 # 9: q 0.5350707 22.47297 0.5608505 0.2749941 # 10: g 0.4675324 14.96104 0.4692404 0.2746589 #... 
+5
source

This is not a dplyr solution, but if you like pipes:

 library(magrittr) toy_summary <- toy_df %>% split(.$group) %>% lapply( function(x) toy_fn(x$value) ) %>% do.call(rbind, .) # > head(toy_summary) # Right Wrong Unanswered Invalid # a 0.5038394 20.15358 0.5905526 0.2846468 # b 0.5048040 15.64892 0.5163702 0.2994544 # c 0.5029442 21.62660 0.5072733 0.2465612 # d 0.5124601 14.86134 0.5382463 0.2681955 # e 0.4649483 17.66804 0.4426197 0.3075080 # f 0.5622644 12.36982 0.6330269 0.2850609 
+3
source

There seems to be a problem when using median (not sure what is going on there), but besides this you can usually use an approach similar to the following with summarise_each to apply several functions. Note that you can specify the names of the resulting columns using a named vector as input to funs_() :

 x <- c(Right = "mean", Wrong = "sd", Unanswered = "sum") toy_df %>% group_by(group) %>% summarise_each(funs_(x), value) #Source: local data frame [26 x 4] # # group Right Wrong Unanswered #1 a 0.5038394 0.2846468 20.15358 #2 b 0.5048040 0.2994544 15.64892 #3 c 0.5029442 0.2465612 21.62660 #4 d 0.5124601 0.2681955 14.86134 #5 e 0.4649483 0.3075080 17.66804 #6 f 0.5622644 0.2850609 12.36982 #7 g 0.4675324 0.2746589 14.96104 #8 h 0.4921506 0.2879830 21.16248 #9 i 0.5443600 0.2945428 22.31876 #10 j 0.5276048 0.3236814 20.57659 #.. ... ... ... ... 
+3
source

using the list(as_tibble(as.list(...)) unnest followed by unnest from tidyr , the trick

 toy_summary2 <- toy_df %>% group_by(group) %>% summarize(Col = list(as_tibble(as.list(toy_fn(value))))) %>% unnest() 
+1
source

Source: https://habr.com/ru/post/987867/


All Articles