Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?

My question is about unnecessary predictors, namely variables that do not provide any new linear information or variables that are linear combinations of other predictors. As you can see, the swiss dataset has six variables.

 library(swiss) names(swiss) # "Fertility" "Agriculture" "Examination" "Education" # "Catholic" "Infant.Mortality" 

Now I am introducing a new ec variable. This is a linear combination of Examination and Education .

 ec <- swiss$Examination + swiss$Catholic 

When we run linear regression with unnecessary variables, R discards terms that are linear combinations of other terms and returns NA as their coefficients. The command below illustrates the point perfectly.

 lm(Fertility ~ . + ec, swiss) Coefficients: (Intercept) Agriculture Examination Education 66.9152 -0.1721 -0.2580 -0.8709 Catholic Infant.Mortality ec 0.1041 1.0770 NA 

However, when we first regress to ec , and then all the regressors, as shown below,

 lm(Fertility ~ ec + ., swiss) Coefficients: (Intercept) ec Agriculture Examination 66.9152 0.1041 -0.1721 -0.3621 Education Catholic Infant.Mortality -0.8709 NA 1.0770 

I would expect both Catholic and Examination ratios to be NA . The variable ec is a linear combination of both of them, but in the end, the Examination coefficient is not NA , while Catholic is NA .

Can anyone explain the reason for this?

+5
source share
2 answers

Will be NA ?

Yes. Adding these columns does not increase the column space. The resulting matrix has a rank disadvantage.

How much is NA ?

It depends on the numerical rank.

 number of NA = number of coefficients - rank of model matrix 

In your example, after the introduction of ec , there will be one NA . Reordering the specification for covariances in the model formula essentially makes the columns shuffled for the matrix model. This does not change the rank of the matrix, so you will always receive only one NA regardless of your order specification.

OK, but which one is NA ?

lm performs LINPACK QR factorization using a restricted column . The order of covariance affects one that is equal to NA . As a rule, the principle “first comes, first serves”, and the position of NA is quite predictable. Give examples to illustrate. In the first specification, these colinear terms are displayed in Examination , Catholic , ec order, so the third ec has a coefficient of NA . In your second specification, these terms are displayed in ec , Examination , Catholic order, and the third Catholic has a NA coefficient. Note that the coefficient estimate is not invariant to the specification order, although the set values ​​are invariant.

If factorization of LAPACK QR with a full column is performed, the coefficient estimate will be invariant to the specification order. However, the position of NA not as predictable as in the case of LINPACK, and is purely solved numerically.


Numerical examples

LAPACK-based QR factorization is implemented in the mgcv package. A numerical rank is detected when the REML score is used, and unidentifiable coefficients are reported as 0 (not NA ). Therefore, we can make a comparison between lm and gam / bam in estimating a linear model. First, build a toy dataset.

 set.seed(0) # an initial full rank matrix X <- matrix(runif(500 * 10), 500) # make the last column as a random linear combination of previous 9 columns X[, 10] <- X[, -10] %*% runif(9) # a random response Y <- rnorm(500) 

Now we shuffle the X columns to see if NA changes its position when evaluating lm , or 0 changes its position when evaluating gam and bam .

 test <- function (fun = lm, seed = 0, ...) { shuffleFit <- function (fun) { shuffle <- sample.int(ncol(X)) Xs <- X[, shuffle] b <- unname(coef(fun(Y ~ Xs, ...))) back <- order(shuffle) c(b[1], b[-1][back]) } set.seed(seed) oo <- t(replicate(10, shuffleFit(fun))) colnames(oo) <- c("intercept", paste0("X", 1:ncol(X))) oo } 

First we check with lm :

 test(fun = lm) 

We see that NA changes its position with the shuffling of column X Estimated odds also vary.


Now we check with gam

 library(mgcv) test(fun = gam, method = "REML") 

We see that the estimate is invariant to shuffling the columns of X , and the coefficient for X5 always 0.


Finally, we check bam ( bam slow for a small dataset, like here. It is for a large or super large dataset, so the following is noticeably slower).

 test(fun = bam, gc.level = -1) 

The result is the same as for gam .

+5
source

ec , exam and catholic are 3 parameters in which you need at least 2 variables to determine the third >. The important part is that 2 out of 3 is required. Now that you pass this to lm, the first two of the three related variables will get the coefficient, and the third with NA. The order of the variables is important. I hope this explains why both the exam and the Catholic are not NA. with only ec ec, you cannot determine both the exam and the catholic

+3
source

Source: https://habr.com/ru/post/1269154/


All Articles