Why does the implementation of the "sklearn` and` statsmodels "OLS regression give different R ^ 2 values?

By chance, I noticed that OLS models implemented with sklearn and statsmodels give different R ^ 2 values ​​if they are not suitable for interception. Otherwise they work fine. The following code gives:

 import numpy as np import sklearn import statsmodels import sklearn.linear_model as sl import statsmodels.api as sm np.random.seed(42) N=1000 X = np.random.normal(loc=1, size=(N, 1)) Y = 2 * X.flatten() + 4 + np.random.normal(size=N) sklernIntercept=sl.LinearRegression(fit_intercept=True).fit(X, Y) sklernNoIntercept=sl.LinearRegression(fit_intercept=False).fit(X, Y) statsmodelsIntercept = sm.OLS(Y, sm.add_constant(X)) statsmodelsNoIntercept = sm.OLS(Y, X) print(sklernIntercept.score(X, Y), statsmodelsIntercept.fit().rsquared) print(sklernNoIntercept.score(X, Y), statsmodelsNoIntercept.fit().rsquared) print(sklearn.__version__, statsmodels.__version__) 

prints:

 0.78741906105 0.78741906105 -0.950825182861 0.783154483028 0.19.1 0.8.0 

Where is the difference?

The question is different from Linear Regression Coefficients with statsmodels and sklearn values , since there sklearn.linear_model.LinearModel (with interception) is suitable for X, prepared as for statsmodels.api.OLS .

The question is different from Statsmodels: calculate the set values ​​and the square R since it takes into account the difference between the two Python packages ( statsmodels and scikit-learn ), while the related question concerns statsmodels and the general definition of R ^ 2. The same answer answers them However, this question was discussed here: Does it have the same answer that the questions should be closed as duplicates?

+5
source share
1 answer

As @ user333700 pointed out in the comments, the OLS definition for R ^ 2 differs in statsmodels ' implementation than in scikit-learn .

From the RegressionResults class documentation (highlight mine):

rsquared

R-squared intercept model. This is defined here as 1 - ssr / centered_tss if the constant is included in the model and 1 - ssr / uncentered_tss if the constant is omitted .

From the documentation of LinearRegression.score() :

estimate (X, y, sample_weight = None)

Returns the coefficient when determining the R ^ 2 prediction.

The coefficient R ^ 2 is defined as (1 - u / v), where u is the residual

the sum of squares ((y_true - y_pred) ** 2) .sum (), and v is the total sum of squares ((y_true - y_true.mean ()) ** 2) .sum (). The best score is 1.0, and it can be negative (since the model can be arbitrarily worse). A constant model, which always predicts the expected value of y, without taking into account the input features, would receive an estimate of R ^ 2 0.0.

0
source

Source: https://habr.com/ru/post/1275405/


All Articles