Apply the sklearn training model to the framework with PySpark

I prepared a random forest algorithm with Python and would like to apply it to a large dataset with PySpark.

First, I loaded the prepared Sklearn RF model (using joblib), loaded my data that contains the functions into the Spark framework, and then I added the prediction column with a user-defined function like this:

def predictClass(features): return rf.predict(features) udfFunction = udf(predictClass, StringType()) new_dataframe = dataframe.withColumn('prediction', udfFunction('features')) 

It takes so long to start, is there a more efficient way to do the same? (without using Spark ML)

+5
source share
1 answer

The sklearn RF model can be quite large when marinated. It is possible that frequent puppets / undocumented models during task scheduling cause a problem. You can use broadcast variables.

In the white paper :

Broadcast variables allow the programmer to save a read-only cached variable on each machine, rather than sending a copy of it with tasks. For example, they can be used to give each node a copy of a large input dataset in an efficient manner. Spark is also trying to distribute broadcast variables using efficient broadcast algorithms to reduce communication costs.

0
source

Source: https://habr.com/ru/post/1268401/


All Articles