Will be NA ?
Yes. Adding these columns does not increase the column space. The resulting matrix has a rank disadvantage.
How much is NA ?
It depends on the numerical rank.
number of NA = number of coefficients - rank of model matrix
In your example, after the introduction of ec , there will be one NA . Reordering the specification for covariances in the model formula essentially makes the columns shuffled for the matrix model. This does not change the rank of the matrix, so you will always receive only one NA regardless of your order specification.
OK, but which one is NA ?
lm performs LINPACK QR factorization using a restricted column . The order of covariance affects one that is equal to NA . As a rule, the principle “first comes, first serves”, and the position of NA is quite predictable. Give examples to illustrate. In the first specification, these colinear terms are displayed in Examination , Catholic , ec order, so the third ec has a coefficient of NA . In your second specification, these terms are displayed in ec , Examination , Catholic order, and the third Catholic has a NA coefficient. Note that the coefficient estimate is not invariant to the specification order, although the set values are invariant.
If factorization of LAPACK QR with a full column is performed, the coefficient estimate will be invariant to the specification order. However, the position of NA not as predictable as in the case of LINPACK, and is purely solved numerically.
Numerical examples
LAPACK-based QR factorization is implemented in the mgcv package. A numerical rank is detected when the REML score is used, and unidentifiable coefficients are reported as 0 (not NA ). Therefore, we can make a comparison between lm and gam / bam in estimating a linear model. First, build a toy dataset.
set.seed(0)
Now we shuffle the X columns to see if NA changes its position when evaluating lm , or 0 changes its position when evaluating gam and bam .
test <- function (fun = lm, seed = 0, ...) { shuffleFit <- function (fun) { shuffle <- sample.int(ncol(X)) Xs <- X[, shuffle] b <- unname(coef(fun(Y ~ Xs, ...))) back <- order(shuffle) c(b[1], b[-1][back]) } set.seed(seed) oo <- t(replicate(10, shuffleFit(fun))) colnames(oo) <- c("intercept", paste0("X", 1:ncol(X))) oo }
First we check with lm :
test(fun = lm)
We see that NA changes its position with the shuffling of column X Estimated odds also vary.
Now we check with gam
library(mgcv) test(fun = gam, method = "REML")
We see that the estimate is invariant to shuffling the columns of X , and the coefficient for X5 always 0.
Finally, we check bam ( bam slow for a small dataset, like here. It is for a large or super large dataset, so the following is noticeably slower).
test(fun = bam, gc.level = -1)
The result is the same as for gam .