There are a couple of issues here. Firstly, this is not a very good way to use lm(...) . lm(...) intended for use with a data frame, with formula expressions referring to columns in df. So, if your data is in two vectors x and y ,
set.seed(1) # for reproducible example x <- 1:11000 y <- 3+0.1*x + rnorm(11000,sd=1000) df <- data.frame(x,y) # training set train <- sample(1:nrow(df),0.75*nrow(df)) # random sample of 75% of data fit <- lm(y~x,data=df[train,])
fit now has a model based on the training set. Using lm(...) , this method allows, for example, to generate predictions without any matrix multiplication.
The second problem is the definition of the R-square. Standard definition :
1 - SS.residuals / SS.total
Only for training set and training set
SS.total = SS.regression + SS.residual
So
SS.regression = SS.total - SS.residual,
and therefore
R.sq = SS.regression / SS.total
therefore R.sq is the fraction of variability in the data set, which is explained by the model and will always be between 0 and 1.
You can see it below.
SS.total <- with(df[train,],sum((y-mean(y))^2)) SS.residual <- sum(residuals(fit)^2) SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2) SS.total - (SS.regression+SS.residual)
But this does not work with the test suite (for example, when you make predictions from the model).
test <- -train test.pred <- predict(fit,newdata=df[test,]) test.y <- df[test,]$y SS.total <- sum((test.y - mean(test.y))^2) SS.residual <- sum((test.y - test.pred)^2) SS.regression <- sum((test.pred - mean(test.y))^2) SS.total - (SS.regression+SS.residual)
There is not much difference in this contrived example, but it is very possible to have R-sq. a value less than 0 (if defined this way).
If, for example, the model is a very weak predictor with a set of tests, then the residuals may actually be greater than the total variation in the test set. This is equivalent to saying that the test suite is better modeled using its value than using a model derived from the training set.
I noticed that you use the first three quarters of your data as a training set, instead of taking an arbitrary sample (as in this example). If the dependence of y on x non-linear, and x is fine, then you can get negative R-sq with a test case.
Regarding the commentary on the OP below, one way to evaluate a model using a test suite is to compare the average squared model (MSE) compared to the model.
mse.train <- summary(fit)$sigma^2 mse.test <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)
If we assume that the training and testing kit is normally distributed with the same dispersion and have means that follow the same model formula, then the relation should have an F distribution with (n.train-2) and (n.test-2) degrees of freedom . If MSE is significantly different from the F-test, then the model does not fit the test data.
Have you built your test.y and pred.y vs x ?? This in itself will tell you a lot.