Creating models and increasing data without losing additional columns in dplyr / broom

Question

Creating models and increasing data without losing additional columns in dplyr / broom

Consider the following data / example. Each data set contains several samples with one observation and one estimate:

library(tidyverse) library(broom) data = read.table(text = ' dataset sample_id observation estimate A A1 4.8 4.7 A A2 4.3 4.5 A A3 3.1 2.9 A A4 2.1 2 A A5 1.1 1 B B1 4.5 4.3 B B2 3.9 4.1 B B3 2.9 3 B B4 1.8 2 B B5 1 1.2 ', header = TRUE)

I want to calculate a linear model for each data set in order to remove any linear bias between observation and evaluation and get the set values next to the original ones:

 data %>% group_by(dataset) %>% do(lm(observation ~ estimate, data = .) %>% augment)

However, this eliminates the sample_id column, which I need to save for further calculations using this dataset based on this unique identifier:

 # A tibble: 10 x 10 # Groups: dataset [2] dataset observation estimate .fitted .se.fit .resid .hat .sigma .cooksd .std.resid <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 A 4.80 4.70 4.68 0.107 0.115 0.478 0.152 0.491 1.04 2 A 4.30 4.50 4.49 0.0996 -0.193 0.416 0.0609 0.957 -1.64 3 A 3.10 2.90 2.97 0.0693 0.135 0.201 0.156 0.120 0.976 4 A 2.10 2.00 2.11 0.0849 -0.00583 0.303 0.189 0.000444 -0.0452 5 A 1.10 1.00 1.15 0.120 -0.0508 0.602 0.180 0.206 -0.521 6 B 4.50 4.30 4.31 0.109 0.191 0.468 0.0597 1.20 1.65 7 B 3.90 4.10 4.09 0.100 -0.193 0.396 0.0844 0.798 -1.56 8 B 2.90 3.00 2.91 0.0713 -0.00630 0.201 0.195 0.000247 -0.0443 9 B 1.80 2.00 1.83 0.0898 -0.0275 0.319 0.193 0.0103 -0.210 10 B 1.00 1.20 0.964 0.125 0.0355 0.616 0.191 0.104 0.360

How to save an extra column from my source dataset?

I saw this answer that nest uses to collapse data before, but I still only get model parameters using this approach. I think I could extract the parameters for the data set:

 data %>% group_by(dataset) %>% nest() %>% mutate( mod = map(data, linear_adj_model), pars = map(mod, tidy) ) %>% unnest(pars) %>% select(dataset, term, estimate) %>% spread(term, estimate)

... which gives me this:

 # A tibble: 2 x 3 dataset `(Intercept)` estimate * <fct> <dbl> <dbl> 1 A 0.196 0.955 2 B -0.330 1.08

... and then left join with the source data, and then mutate each estimate to get a linearly corrected one, but that seems too complicated.

Another ugly hack I found is to add a column as a dummy variable to the model:

 data %>% group_by(dataset) %>% do(lm(observation ~ estimate + 0 * sample_id, data = .) %>% augment)

Is there a simpler (neat) solution that does not involve manually specifying the variables that I want to keep?

+5

r dplyr tidyverse broom

slhck Feb 05 '18 at 16:23

source share

4 answers

 numbered_data <- data %>% mutate(row = row_number()) numbered_data %>% group_by(dataset) %>% do(augment(lm(observation ~ estimate + 0*row, data = .))) %>% left_join(numbered_data %>% select(-observation, -estimate), by=c('dataset', 'row')) %>% select(-row)

This approach also relies on using a dummy variable in the model, but it does so in an agnostic column. Having defined a new dummy column row , you can left_join() return your original data to the results of augment() , restoring an arbitrary number of columns without specifying them manually.

I find this more readable than other solutions, but it's still a bit hacky. Getting rid of duplicate columns from left_join bit tedious. You probably don't need columns like observation.x and estimate.y in your output, which you will have if you don't let go of the select(-observation, -estimate) .

+1

Curt F. Feb 05 '18 at 23:51

source share

This is essentially the same as Marcus's answer, but perhaps a little cleaner.

 library(tidyverse) library(broom) data = read.table(text = ' dataset sample_id observation estimate A A1 4.8 4.7 A A2 4.3 4.5 A A3 3.1 2.9 A A4 2.1 2 A A5 1.1 1 B B1 4.5 4.3 B B2 3.9 4.1 B B3 2.9 3 B B4 1.8 2 B B5 1 1.2 ', header = TRUE) data %>% group_by(dataset) %>% nest() %>% mutate(mod = map(data, ~lm(observation ~ estimate, data = .)), aug = map2(mod, data, ~augment_columns(.x, .y))) %>% unnest(aug)

+1

Sam abbott Feb 07 '18 at 16:07

source share

How about this:

 DF %>% group_by(dataset) %>% do(cbind(sample_id = .$sample_id, lm(observation ~ estimate, data = .) %>% augment))

Too ugly?

0

Rollingandc Feb 05 '18 at 17:16

source share

markus · Accepted Answer · 2018-02-05T18:34:17+0000

You can use broom::augment_columns instead of augment . The two arguments to the function that we need are x - the "model" - and data - the "source data to which the columns should be added."

 library(tidyverse) library(broom) split(data, data$dataset) %>% map(., ~lm(formula = observation ~ estimate, data = .)) %>% map2(.x = ., .y = split(data, f = data$dataset), .f = ~augment_columns(x = .x, data = .y)) %>% bind_rows() %>% select(-.rownames) # dataset sample_id observation estimate .fitted .se.fit .resid .hat .sigma .cooksd .std.resid #1 A A1 4.8 4.7 4.6845093 0.10675590 0.115490737 0.4781238 0.15157780 0.4911635931 1.03547990 #2 A A2 4.3 4.5 4.4934963 0.09956065 -0.193496255 0.4158455 0.06089193 0.9570799385 -1.63978525 #3 A A3 3.1 2.9 2.9653922 0.06929022 0.134607804 0.2014190 0.15623754 0.1200409795 0.97563873 #4 A A4 2.1 2.0 2.1058337 0.08491818 -0.005833662 0.3025227 0.18902495 0.0004439221 -0.04524332 #5 A A5 1.1 1.0 1.1507686 0.11979870 -0.050768624 0.6020891 0.18032220 0.2055920869 -0.52129162 #6 B B1 4.5 4.3 4.3087226 0.10879087 0.191277434 0.4679235 0.05965705 1.1954021471 1.64881395 #7 B B2 3.9 4.1 4.0929657 0.10006757 -0.192965672 0.3958920 0.08438937 0.7984863377 -1.56105324 #8 B B3 2.9 3.0 2.9063028 0.07128455 -0.006302757 0.2009004 0.19471901 0.0002470587 -0.04433279 #9 B B4 1.8 2.0 1.8275183 0.08983650 -0.027518289 0.3190771 0.19335019 0.0103015495 -0.20968503 #10 B B5 1.0 1.2 0.9644907 0.12484420 0.035509285 0.6162071 0.19051943 0.1042741368 0.36040302

The idea is to split data for the data set, fit the model to each component of the list, and then use map2 to iterate over the models and (complete) data used to build the model, i.e. split(data, f = data$dataset) in parallel.

augment_columns adds the .rownames column, so select in the last row.

change

The same solution, but hopefully easier to read.

 data_split <- split(data, data$dataset) models <- map(data_split, ~lm(formula = observation ~ estimate, data = .)) map2(.x = models, .y = data_split, .f = ~augment_columns(x = .x, data = .y)) %>% bind_rows() %>% select(-.rownames)

The first code block as a function that has four arguments: df , split_var , dependend_var and explanatory_var .

 augment_df <- function(df, split_var, dependend_var, explanatory_var) { require(tidyverse) require(broom) split(df, df[split_var]) %>% map(., ~lm(formula = as.formula(paste0(dependend_var, " ~ ", explanatory_var)), data = .)) %>% map2(.x = ., .y = split(df, df[split_var]), .f = ~augment_columns(x = .x, data = .y)) %>% bind_rows() %>% select(-.rownames) } augment_df(df = data, split_var = "dataset", dependend_var = "observation", explanatory_var = "estimate")

Creating models and increasing data without losing additional columns in dplyr / broom

More articles: