Predict .lm with newdata

Question

Predict .lm with newdata

I built the lm model without using the data= parameter:

 m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

Now I would like to predict m1 using newdata and call my new data.frame so that it matches the variables used in the lm() call above.

With newComps as my new gc.pc (which, like the gc.tA prediction, was predicted using the new data.frame without any problems), I tried

 newD <- data.frame( newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept])) names(newD) <- names(m1$coefficients)[-1] names(newD) <- names(m1$model)[-1] names(newD) <- c( "gc.pc$scores[, 1]" , "gc.pc$scores[, 2]" , "gc.pc$scores[, 3]" , "gc.pc$scores[, 4]" , "gc.pc$scores[, 5]" , "gc.pc$scores[, 6]" , "predict(gc.tA)" ) names(newD) <- c( "gc.pc$scores[,1]" , "gc.pc$scores[,2]" , "gc.pc$scores[,3]" , "gc.pc$scores[,4]" , "gc.pc$scores[,5]" , "gc.pc$scores[,6]" , "predict(gc.tA)" )

Unfortunately, predict.lm does not accept the naming strategies above and returns a dangerous newdata warning along with predictions from the original data.frame that m1 built:

 Warning message: 'newdata' had 100 rows but variable(s) found have 1414 rows

How can I name the newD columns to call the predict call? Thanks.

The code below recreates the problem:

  require(rpart) set.seed(123) X <- matrix(runif(200) , 20 , 10) gc.pc <- princomp(X) y <- runif(20) mdldvlp.trim <- data.frame(y,X) names(mdldvlp.trim) <- c("y",paste("x",1:10,sep="")) predKept <- paste("x",1:10,sep="") gc.tA <- rpart( y ~ . , data = mdldvlp.trim) m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA)) mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10)) names(mdldvlp) <- predKept newComps <- predict( gc.pc , newdata=mdldvlp ) newD <- data.frame( newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept])) # enter newD naming strategy here predict( m1 , newdata=newD )

4/20 Follow up:

Thank you all for your answers. I understand that it would be simpler by first creating a data.frame with correctly named predictors. I understand it. My question is that the simulation data frame really evaluates the data frame with variables named gc.pc$scores[,1] etc., Why not use the name strategies used above with predict.lm ? In other words, does lm really evaluate its simulation data file with gc.pc$scores[,1] and so on? If this happened, were the renaming strategies above working in predict.lm ?

+6

r

M.Dimo Apr 20 '12 at 3:54

source share

3 answers

Gavin simpson · Answer 1 · 2012-04-20T07:42:29+0000

You abuse the musical formula, and that is what causes you problems. Essentially your formula:

 m1 <- lm( mdldvlp.trim$y ~ gc.pc$scores[,1] + gc.pc$scores[,2] + gc.pc$scores[,3] + gc.pc$scores[,4] + gc.pc$scores[,5] + gc.pc$scores[,6] + predict(gc.tA))

will be evaluated by a data frame with variables named gc.pc$scores[,1] , etc. When you use predict() , it will look for variables with the same names in the object passed to the newdata argument.

Ideally, you should create a data object with all the variables that you want to include in them, with the corresponding names, for example:

 fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA)) names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds")

and then install the model with:

 m1 <- lm(trimY ~ ., data = fitData)

New predictions can be made from the model by providing a data frame with the same names that are used to match the model. Therefore, using newD :

 newD <- data.frame(newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept])) names(newD) <- c(paste("scores", 1:6, sep = ""), "preds")

and then predict()

 predict(m1 , newdata=newD)

Full example

 require(rpart) set.seed(123) X <- matrix(runif(200) , 20 , 10) gc.pc <- princomp(X) y <- runif(20) mdldvlp.trim <- data.frame(y,X) names(mdldvlp.trim) <- c("y",paste("x",1:10,sep="")) predKept <- paste("x",1:10,sep="") gc.tA <- rpart( y ~ . , data = mdldvlp.trim) fitData <- data.frame(mdldvlp.trim$y, gc.pc$scores[, 1:6], predict(gc.tA)) names(fitData) <- c("trimY", paste("scores", 1:6, sep = ""), "preds") m1 <- lm(trimY ~ ., data = fitData) mdldvlp <- data.frame(matrix(runif(2000) , 200 , 10)) names(mdldvlp) <- predKept newComps <- predict( gc.pc , newdata=mdldvlp ) newD <- data.frame(newComps[1:100,1:6] , predict(gc.tA , newdata = mdldvlp[1:100,predKept])) names(newD) <- c(paste("scores", 1:6, sep = ""), "preds") predict(m1 , newdata=newD)

Marc in the box · Answer 2 · 2012-04-20T07:17:13+0000

I had a similar problem in the past - I decided that I solved it by specifying my variable names instead of accessing the column number. for example Do not use gc.pc [, 1], but convert the gc.pc matrix to a data framework and add the names to the columns ("PC1", "PC2", etc.). Then make sure your newdata also use these names (also in the data frame).

Mark egge · Answer 3 · 2015-11-01T04:11:23+0000

I had a similar problem. If my data frame had three or more variables (one result and two or more prediction variables), I had no problems accessing the columns by their column number. But when there were only two variables in my data frame (one result, one predictor), R gave me many errors, including 'newdata' had 1 row but variables found have xx rows

Following the example of Marc in the box clause, I wrote a special case for instances in which a data frame has only two variables and assigned variable names. This fixed my problem.

To fix my warning, I rewrote:

 lr <- lm(train[ , ncol(train)] ~ ., data = train[ , -ncol(train)])

as:

 if(ncol(train) == 2) { colnames(train) <- c('var1','var2') colnames(test) <- c('var1','var2') lr <- lm(var2 ~ var1, data = train) } else if (ncol(train) > 2) { lr <- lm(train[ , ncol(train)] ~ ., data = train[ , -ncol(train)]) }

Predict .lm with newdata

4/20 Follow up:

Full example

More articles: