R-square according to test

I put the linear regression model at 75% of my dataset, which includes ~ 11000 cases and 143 variables:

gl.fit <- lm(y[1:ceiling(length(y)*(3/4))] ~ ., data= x[1:ceiling(length(y)*(3/4)),]) #3/4 for training

and I got R ^ 2 0.43. Then I tried to predict on my test data using the rest of the data:

ytest=y[(ceiling(length(y)*(3/4))+1):length(y)] x.test <- cbind(1,x[(ceiling(length(y)*(3/4))+1):length(y),]) #The rest for test yhat <- as.matrix(x.test)%*%gl.fit$coefficients #Calculate the predicted values

Now I would like to calculate the value of R ^ 2 in my test data. Is there an easy way to figure this out?

thanks

+9
source share
4 answers

There are a couple of issues here. Firstly, this is not a very good way to use lm(...) . lm(...) intended for use with a data frame, with formula expressions referring to columns in df. So, if your data is in two vectors x and y ,

 set.seed(1) # for reproducible example x <- 1:11000 y <- 3+0.1*x + rnorm(11000,sd=1000) df <- data.frame(x,y) # training set train <- sample(1:nrow(df),0.75*nrow(df)) # random sample of 75% of data fit <- lm(y~x,data=df[train,]) 

fit now has a model based on the training set. Using lm(...) , this method allows, for example, to generate predictions without any matrix multiplication.

The second problem is the definition of the R-square. Standard definition :

1 - SS.residuals / SS.total

Only for training set and training set

SS.total = SS.regression + SS.residual

So

SS.regression = SS.total - SS.residual,

and therefore

R.sq = SS.regression / SS.total

therefore R.sq is the fraction of variability in the data set, which is explained by the model and will always be between 0 and 1.

You can see it below.

 SS.total <- with(df[train,],sum((y-mean(y))^2)) SS.residual <- sum(residuals(fit)^2) SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2) SS.total - (SS.regression+SS.residual) # [1] 1.907349e-06 SS.regression/SS.total # fraction of variation explained by the model # [1] 0.08965502 1-SS.residual/SS.total # same thing, for model frame ONLY!!! # [1] 0.08965502 summary(fit)$r.squared # both are = R.squared # [1] 0.08965502 

But this does not work with the test suite (for example, when you make predictions from the model).

 test <- -train test.pred <- predict(fit,newdata=df[test,]) test.y <- df[test,]$y SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.pred)^2) SS.regression <- sum((test.pred - mean(test.y))^2) SS.total - (SS.regression+SS.residual) # [1] 8958890 # NOT the fraction of variability explained by the model test.rsq <- 1 - SS.residual/SS.total test.rsq # [1] 0.0924713 # fraction of variability explained by the model SS.regression/SS.total # [1] 0.08956405 

There is not much difference in this contrived example, but it is very possible to have R-sq. a value less than 0 (if defined this way).

If, for example, the model is a very weak predictor with a set of tests, then the residuals may actually be greater than the total variation in the test set. This is equivalent to saying that the test suite is better modeled using its value than using a model derived from the training set.

I noticed that you use the first three quarters of your data as a training set, instead of taking an arbitrary sample (as in this example). If the dependence of y on x non-linear, and x is fine, then you can get negative R-sq with a test case.

Regarding the commentary on the OP below, one way to evaluate a model using a test suite is to compare the average squared model (MSE) compared to the model.

 mse.train <- summary(fit)$sigma^2 mse.test <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2) 

If we assume that the training and testing kit is normally distributed with the same dispersion and have means that follow the same model formula, then the relation should have an F distribution with (n.train-2) and (n.test-2) degrees of freedom . If MSE is significantly different from the F-test, then the model does not fit the test data.

Have you built your test.y and pred.y vs x ?? This in itself will tell you a lot.

+18
source

Calculating the R-square from the test data is a little difficult, since you must remember what your base level is. Your baseline forecast is the average of your training data.

Therefore, expanding the example provided by @jlhoward above:

 SS.test.total <- sum((test.y - mean(df[train,]$y))^2) SS.test.residual <- sum((test.y - test.pred)^2) SS.test.regression <- sum((test.pred - mean(df[train,]$y))^2) SS.test.total - (SS.test.regression+SS.test.residual) # [1] 11617720 not 8958890 test.rsq <- 1 - SS.test.residual/SS.test.total test.rsq # [1] 0.09284556 not 0.0924713 # fraction of variability explained by the model SS.test.regression/SS.test.total # [1] 0.08907705 not 0.08956405 

Update: miscTools::rSquared() makes the assumption that the R-square is calculated on the same dataset on which the model is trained, since it calculates

 yy <- y - mean(y) 

backstage at line 184 here: https://github.com/cran/miscTools/blob/master/R/utils.R

+5
source

If you need a function, the miscTools package has an rSquared function.

 require(miscTools) r2 <- rSquared(ytest, resid = ytest-yhat) 
+2
source

When you use measure R2 in a (outside) sample, you lose some aspects of the interpretation of R2:

  • SSR equivalence sum = SSR explanation + SSR error
  • The fact that R2 is equal to the square of the correlation between y and the predicted y
  • The fact that R2 is in [0,1]

If you want to use R, I would recommend the function modelr::rsquare . Please note that this uses the total SSR from the test sample, not the training sample (as some people seem to recommend).

Here I will give an example where our train data has only 3 points, therefore, there is a high risk that we have a poor model, and therefore poor performance outside the sample. Indeed, you can see that R2 is negative!

 library(modelr) train <- mtcars[c(1,3,4),] test <- mtcars[-c(1,3,4),] mod <- lm(carb ~ drat, data = train) 

Calculate on the train data:

 ## train y_train <- train$carb SSR_y_train <- sum((y_train-mean(y_train))^2) cor(fitted(mod), y_train)^2 #> [1] 0.2985092 rsquare(mod, train) #> [1] 0.2985092 1-sum(residuals(mod)^2)/SSR_y_train #> [1] 0.2985092 

Calculate on test data:

 ## test pred_test <- predict(mod, newdata = test) y_test <- test$carb SSR_y_test <- sum((y_test-mean(y_test))^2) cor(pred_test, y_test)^2 #> [1] 0.01737236 rsquare(mod, test) #> [1] -0.6769549 1- 28* var(pred_test-y_test)/SSR_y_train #> [1] -19.31621 1- 28* var(pred_test-y_test)/SSR_y_test #> [1] -0.6769549 
0
source

Source: https://habr.com/ru/post/974896/


All Articles