Export Scikit Learn Random Forest for use on the Hadoop platform

I developed a spam classifier using pandas and scikit to find out where it is ready for integration with our hadoop system. To do this, I need to export my classifier to a more general format than etching.

Predictive Model Markup Language (PMML) is my preferred export format. He plays very well with Cascading, which we already use. However, I suddenly cannot find python libraries that export scikit-learn models in PMML.

Has anyone had experience using this use case? Is there any alternative to PMML that will provide compatibility between scikit-learn and hadoop? What about the solid PMML export library?

+6
source share
1 answer

You can use Py2PMML to export the model to PMML and then evaluate it on Hadoop using JPMML-Cascading . JPMML is open source, but Zementis Py2PMML seems to be a commercial product. Other than this alternative, there are no other tools for evaluating Scikit models exported as PMML in Java / Hadoop. The main scikit team is planning to implement the PMML exporter. But if you do not need commercial solutions or wait until such a tool is implemented, you still have some options, but they require some coding:

  • Adapt the SKLearn Compiled trees project so that it generates Java / MapReduce code instead of C.
  • Using the export_graphviz function, get the DOT representation of each decision tree and write a small Java interpreter.
  • Forget Java and Hadoop, use Apache Spark and evaluate each of the decision trees in parallel using Python, Scikit, and PySpark.

Hope this helps!

+9
source

Source: https://habr.com/ru/post/970814/


All Articles