Using statsmodel evaluations with scikit-learn cross validation, is this possible?

Question

Using statsmodel evaluations with scikit-learn cross validation, is this possible?

I posted this question on the Cross Validated forum and later realized that I could find a suitable audience in stackoverlfow.

I'm looking for a way that I can use the fit object (result) obtained from python statsmodel to feed the scikit-learn cross_validation method to cross_val_score? The attached link says that this is possible, but I did not succeed.

I get the following error

the evaluator must be an evaluator that implements the fit method. statsmodels.discrete.discrete_model.BinaryResultsWrapper object in 0x7fa6e801c590 has been passed

Link to this link

+9

python scikit-learn statsmodels cross-validation

Cartman Dec 08 '16 at 17:51

source share

1 answer

David dale · Answer 1 · 2018-02-23T14:06:22+0000

Indeed, you cannot use cross_val_score directly on statsmodels because of the different interface: in statsmodels

training data is passed directly to the constructor
a separate object contains the result of the evaluation of the model

However, you can write a simple wrapper to make statsmodels look like sklearn :

 import statsmodels.api as sm from sklearn.base import BaseEstimator, RegressorMixin class SMWrapper(BaseEstimator, RegressorMixin): """ A universal sklearn-style wrapper for statsmodels regressors """ def __init__(self, model_class, fit_intercept=True): self.model_class = model_class self.fit_intercept = fit_intercept def fit(self, X, y): if self.fit_intercept: X = sm.add_constant(X) self.model_ = self.model_class(y, X) self.results_ = self.model_.fit() def predict(self, X): if self.fit_intercept: X = sm.add_constant(X) return self.results_.predict(X)

This class contains the correct fit and predict methods, and can be used with sklearn , for example, cross-checking or included in a pipeline. Like here:

 from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression X, y = make_regression(random_state=1, n_samples=300, noise=100) print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2')) print(cross_val_score(LinearRegression(), X, y, scoring='r2'))

You can see that the output of the two models is identical because they are both OLS models cross-checked in the same way.

 [0.28592315 0.37367557 0.47972639] [0.28592315 0.37367557 0.47972639]

Using statsmodel evaluations with scikit-learn cross validation, is this possible?

More articles: