I built the lm
model without using the data=
parameter:
m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))
Now I would like to predict m1
using newdata
and call my new data.frame so that it matches the variables used in the lm()
call above.
With newComps
as my new gc.pc
(which, like the gc.tA
prediction, was predicted using the new data.frame without any problems), I tried
newD <- data.frame( newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept])) names(newD) <- names(m1$coefficients)[-1] names(newD) <- names(m1$model)[-1] names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" , "gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" , "predict(gc.tA)" ) names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" , "gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" , "predict(gc.tA)" )
Unfortunately, predict.lm
does not accept the naming strategies above and returns a dangerous newdata
warning along with predictions from the original data.frame that m1
built:
Warning message: 'newdata' had 100 rows but variable(s) found have 1414 rows
How can I name the newD
columns to call the predict
call? Thanks.
The code below recreates the problem:
require(rpart) set.seed(123) X <- matrix(runif(200) , 20 , 10) gc.pc <- princomp(X) y <- runif(20) mdldvlp.trim <- data.frame(y,X) names(mdldvlp.trim) <- c("y",paste("x",1:10,sep="")) predKept <- paste("x",1:10,sep="") gc.tA <- rpart( y ~ . , data = mdldvlp.trim) m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA)) mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10)) names(mdldvlp) <- predKept newComps <- predict( gc.pc , newdata=mdldvlp ) newD <- data.frame( newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept]))
4/20 Follow up:
Thank you all for your answers. I understand that it would be simpler by first creating a data.frame with correctly named predictors. I understand it. My question is that the simulation data frame really evaluates the data frame with variables named gc.pc$scores[,1]
etc., Why not use the name strategies used above with predict.lm
? In other words, does lm
really evaluate its simulation data file with gc.pc$scores[,1]
and so on? If this happened, were the renaming strategies above working in predict.lm
?