I am having trouble executing a procedure using the dplyr package. In short, I have a function that takes a data frame as an input and returns a single (numeric) value; I would like to be able to apply this function to several subsets of data. It seems to me that I should be able to use group_by () to define subsets of the data frame, then connect to the summary () function, but I'm not sure how to pass the (subset) data frame along with the function I'd like to apply.
As a simplified example, suppose I use an iris dataset and I have a pretty simple function that I would like to apply to several subsets of data:
data(iris) lm.func = function(.data){ lm.fit = lm(Petal.Width ~ Petal.Length, data = .data) out = summary(lm.fit)$coefficients[2,1] return(out) }
Now, I would like to apply this function to subsets of iris based on some other variable, for example Views . I can manually filter the data and then make a connection to my function, for example:
iris %>% filter(Species == "setosa") %>% lm.func(.)
But I would like to apply lm.func to every subset of data based on views. My first thought was to try something like the following:
iris %>% group_by(Species) %>% summarize(coef.val = lm.func(.))
Despite the fact that I know that this does not work, my idea is to try to pass each subset of the iris to the lm.func function.
To clarify, I would like to eventually create a data block with two columns - the first with each level of the grouping variable, and the second with the output lm.func , when the data is limited to the subset specified by the grouping variable.
Is it possible to use sumize () in this way?