Model.frame and update

In R, you could evaluate a model with a log-dependent dependent variable:

mfit <- lm( formula = log(salary) ~ yrs.service + yrs.since.phd, data = Salaries ) 

Then you can change the frame of the model and trigger an update to update the model:

 n <- nrow(Salaries) mfr <- model.frame(mfit)[sample(1:n, size=n, replace=TRUE),] mfit2 <- update(mfit, data = mfr) 

This will result in an error:

 Error in eval(expr, envir, enclos) : object 'salary' not found 

The reason is that the formula still has the dependent variable log(salary) , and the variable in the model frame is called log(salary) . R thinks he can find salary and then call log on it. The same error would have occurred without re-sampling, the example just shows why this might be needed.

The procedure above applies to the download package where re-fetch lines are executed. Is this behavior expected, or is it a mistake? I know you can get around this by converting the variables into a data argument, but it seems annoying and missing ...

+5
source share
2 answers

Instead of fetching from the result of model.frame , you can na.omit(get_all_vars(myformula, Salaries)) from na.omit(get_all_vars(myformula, Salaries)) . So your example will be as follows:

 myformula <- log(salary) ~ yrs.service + yrs.since.phd mfit <- lm(formula = myformula, data = Salaries) n <- nrow(Salaries) newdata <- na.omit(get_all_vars(myformula, Salaries))[sample(1:n, size=n, replace=TRUE),] mfit2 <- update(mfit, data = newdata) 

We can use the following simple example to confirm that model.frame(myformula, df) and na.omit(get_all_vars(myformula, df)) select the same raw (non-transformed) data from the data frame:

 df <- data.frame(w = rnorm(10), x = rnorm(10), y = rnorm(10), z = rnorm(10)) df[1, 1] <- NA df[2, 2] <- NA df[3, 3] <- NA df[4, 4] <- NA identical(data.frame(na.omit(get_all_vars(z ~ w + x, df))), data.frame(model.frame(z ~ w + x, df))) # [1] TRUE 

Note that I wrapped the results of na.omit(get_all_vars(...)) and model.frame(...) in data.frame to remove third-party attributes for comparison. Of course, model.frame does extra work, such as salary conversion in your example. But if all you have to do is sample the original data, then na.omit(get_all_vars(...)) works fine, and then you can pass your new data frame to lm or update .

0
source

I do not think this is a mistake. Since the formula can take functions and operators, i.e.

 log(foo)*3 ~ abs(fooller) + fooz 

It cannot separate the abs(fooller) object from the result of the abs() function with the fooller argument.

In my point of view, this is the problem of naming conventions . You do not name a variable or column as a name that might be misunderstood as a function. You can use salary.log .

-1
source

Source: https://habr.com/ru/post/1400769/


All Articles