Saving a model in different versions of scikit-learn is generally impossible. The reason is obvious: you choose Class1 with one definition and want to decompose it into Class2 with another definition.
You can:
- Try sticking to one version of sklearn.
- Ignore the warnings and hope that what worked for
Class1 will work for Class2 . - Write your own class that can serialize your
GradientBoostingClassifier and restore it from this serialized form, and hope it works better than a pickle.
I made an example of how you can convert one DecisionTreeRegressor to a clean list and format that is fully compatible with JSON and restore it.
import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import make_classification ### Code to serialize and deserialize trees LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples'] TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_'] def serialize_tree(tree): """ Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """ encoded = { 'nodes': {}, 'tree': {}, 'n_leaves': len(tree.tree_.threshold), 'params': tree.get_params() } for attr in LEAF_ATTRIBUTES: encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist() for attr in TREE_ATTRIBUTES: encoded['tree'][attr] = getattr(tree, attr) return encoded def deserialize_tree(encoded): """ Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """ x = np.arange(encoded['n_leaves']) tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x) tree.set_params(**encoded['params']) for attr in LEAF_ATTRIBUTES: for i in range(encoded['n_leaves']): getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i] for attr in TREE_ATTRIBUTES: setattr(tree, attr, encoded['tree'][attr]) return tree ## test the code X, y = make_classification(n_classes=3, n_informative=10) tree = DecisionTreeRegressor().fit(X, y) encoded = serialize_tree(tree) decoded = deserialize_tree(encoded) assert (decoded.predict(X)==tree.predict(X)).all()
After that, you can continue to serialize and deserialize the entire GradientBoostingClassifier :
from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator def serialize_gbc(clf): encoded = { 'classes_': clf.classes_.tolist(), 'max_features_': clf.max_features_, 'n_classes_': clf.n_classes_, 'n_features_': clf.n_features_, 'train_score_': clf.train_score_.tolist(), 'params': clf.get_params(), 'estimators_shape': list(clf.estimators_.shape), 'estimators': [], 'priors':clf.init_.priors.tolist() } for tree in clf.estimators_.reshape((-1,)): encoded['estimators'].append(serialize_tree(tree)) return encoded def deserialize_gbc(encoded): x = np.array(encoded['classes_']) clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x) trees = [deserialize_tree(tree) for tree in encoded['estimators']] clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape']) clf.init_ = PriorProbabilityEstimator() clf.init_.priors = np.array(encoded['priors']) clf.classes_ = np.array(encoded['classes_']) clf.train_score_ = np.array(encoded['train_score_']) clf.max_features_ = encoded['max_features_'] clf.n_classes_ = encoded['n_classes_'] clf.n_features_ = encoded['n_features_'] return clf
This works for scikit-learn v0.19, but donβt ask me what will happen in future versions to break this code. I am not a prophet or sklearn developer.
If you want to be completely independent of new sklearn versions, the safest thing is to write a function that traverses the serialized tree and makes a prediction instead of re-creating the sklearn tree.