Apply the sklearn training model to the framework with PySpark

Question

Apply the sklearn training model to the framework with PySpark

I prepared a random forest algorithm with Python and would like to apply it to a large dataset with PySpark.

First, I loaded the prepared Sklearn RF model (using joblib), loaded my data that contains the functions into the Spark framework, and then I added the prediction column with a user-defined function like this:

def predictClass(features): return rf.predict(features) udfFunction = udf(predictClass, StringType()) new_dataframe = dataframe.withColumn('prediction', udfFunction('features'))

It takes so long to start, is there a more efficient way to do the same? (without using Spark ML)

+5

python scikit-learn apache-spark pyspark

Pierre May 31, '17 at 13:14

source share

1 answer

peter · Answer 1 · 2018-04-06T05:44:43+0000

The sklearn RF model can be quite large when marinated. It is possible that frequent puppets / undocumented models during task scheduling cause a problem. You can use broadcast variables.

In the white paper :

Broadcast variables allow the programmer to save a read-only cached variable on each machine, rather than sending a copy of it with tasks. For example, they can be used to give each node a copy of a large input dataset in an efficient manner. Spark is also trying to distribute broadcast variables using efficient broadcast algorithms to reduce communication costs.

Apply the sklearn training model to the framework with PySpark

More articles: