Scikit-learn persistence: pickle vs pmml vs ...?

Question

Scikit-learn persistence: pickle vs pmml vs ...?

I built scikit-learn , and I want to reuse cron python in the daily job ( NB : no other platforms involved - no R, no Java and c).

I pickled (in fact, I pickled my own object, one field of which is GradientBoostingClassifier ), and I do not unravel it in the cron job. So far, so good (and discussed in Save classifier to disk in scikit-learn and Persistence model in Scikit-Learn? ).

However, I updated sklearn and now I get the following warnings:

 .../.local/lib/python2.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator DecisionTreeRegressor from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk. UserWarning) .../.local/lib/python2.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator PriorProbabilityEstimator from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk. UserWarning) .../.local/lib/python2.7/site-packages/sklearn/base.py:315: UserWarning: Trying to unpickle estimator GradientBoostingClassifier from version 0.18.1 when using version 0.18.2. This might lead to breaking code or invalid results. Use at your own risk. UserWarning)

What should I do now?

I can lower it to 0.18.1 and stick to it until I can rebuild the model. For various reasons, I find this unacceptable.
I can unravel the file and decompose it again. This worked with 0.18.2, but broke with 0.19 . UFH. joblib doesn't look any better.
I would like to save the data in a version-independent ASCII format (e.g. JSON or XML). This is obviously the best solution, but there seems to be no way to do this (see also Sklearn - model persistence without pkl file ).
I could save the PMML model, but its support is small at best: I can use sklearn2pmml to save the model (although not easy), and augustus / lightpmmlpredictor apply (although not load) the model, however, none of them are available for pip directly, which makes deployment a nightmare. In addition, the augustus and lightpmmlpredictor seem dead. Import PMML models into Python (Scikit-learn) - no.
Option above: save PMML with sklearn2pmml and use openscoring for scoring. Interfacing with an external process is required. Yuk.

suggestions

+5

python python-2.7 scikit-learn pmml

sds Jul 12 '17 at 15:31

source share

1 answer

David dale · Answer 1 · 2017-11-24T16:27:04+0000

Saving a model in different versions of scikit-learn is generally impossible. The reason is obvious: you choose Class1 with one definition and want to decompose it into Class2 with another definition.

You can:

Try sticking to one version of sklearn.
Ignore the warnings and hope that what worked for Class1 will work for Class2 .
Write your own class that can serialize your GradientBoostingClassifier and restore it from this serialized form, and hope it works better than a pickle.

I made an example of how you can convert one DecisionTreeRegressor to a clean list and format that is fully compatible with JSON and restore it.

 import numpy as np from sklearn.tree import DecisionTreeRegressor from sklearn.datasets import make_classification ### Code to serialize and deserialize trees LEAF_ATTRIBUTES = ['children_left', 'children_right', 'threshold', 'value', 'feature', 'impurity', 'weighted_n_node_samples'] TREE_ATTRIBUTES = ['n_classes_', 'n_features_', 'n_outputs_'] def serialize_tree(tree): """ Convert a sklearn.tree.DecisionTreeRegressor into a json-compatible format """ encoded = { 'nodes': {}, 'tree': {}, 'n_leaves': len(tree.tree_.threshold), 'params': tree.get_params() } for attr in LEAF_ATTRIBUTES: encoded['nodes'][attr] = getattr(tree.tree_, attr).tolist() for attr in TREE_ATTRIBUTES: encoded['tree'][attr] = getattr(tree, attr) return encoded def deserialize_tree(encoded): """ Restore a sklearn.tree.DecisionTreeRegressor from a json-compatible format """ x = np.arange(encoded['n_leaves']) tree = DecisionTreeRegressor().fit(x.reshape((-1,1)), x) tree.set_params(**encoded['params']) for attr in LEAF_ATTRIBUTES: for i in range(encoded['n_leaves']): getattr(tree.tree_, attr)[i] = encoded['nodes'][attr][i] for attr in TREE_ATTRIBUTES: setattr(tree, attr, encoded['tree'][attr]) return tree ## test the code X, y = make_classification(n_classes=3, n_informative=10) tree = DecisionTreeRegressor().fit(X, y) encoded = serialize_tree(tree) decoded = deserialize_tree(encoded) assert (decoded.predict(X)==tree.predict(X)).all()

After that, you can continue to serialize and deserialize the entire GradientBoostingClassifier :

 from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble.gradient_boosting import PriorProbabilityEstimator def serialize_gbc(clf): encoded = { 'classes_': clf.classes_.tolist(), 'max_features_': clf.max_features_, 'n_classes_': clf.n_classes_, 'n_features_': clf.n_features_, 'train_score_': clf.train_score_.tolist(), 'params': clf.get_params(), 'estimators_shape': list(clf.estimators_.shape), 'estimators': [], 'priors':clf.init_.priors.tolist() } for tree in clf.estimators_.reshape((-1,)): encoded['estimators'].append(serialize_tree(tree)) return encoded def deserialize_gbc(encoded): x = np.array(encoded['classes_']) clf = GradientBoostingClassifier(**encoded['params']).fit(x.reshape(-1, 1), x) trees = [deserialize_tree(tree) for tree in encoded['estimators']] clf.estimators_ = np.array(trees).reshape(encoded['estimators_shape']) clf.init_ = PriorProbabilityEstimator() clf.init_.priors = np.array(encoded['priors']) clf.classes_ = np.array(encoded['classes_']) clf.train_score_ = np.array(encoded['train_score_']) clf.max_features_ = encoded['max_features_'] clf.n_classes_ = encoded['n_classes_'] clf.n_features_ = encoded['n_features_'] return clf # test on the same problem clf = GradientBoostingClassifier() clf.fit(X, y); encoded = serialize_gbc(clf) decoded = deserialize_gbc(encoded) assert (decoded.predict(X) == clf.predict(X)).all()

This works for scikit-learn v0.19, but don’t ask me what will happen in future versions to break this code. I am not a prophet or sklearn developer.

If you want to be completely independent of new sklearn versions, the safest thing is to write a function that traverses the serialized tree and makes a prediction instead of re-creating the sklearn tree.

Scikit-learn persistence: pickle vs pmml vs ...?

More articles: