I prepared a random forest algorithm with Python and would like to apply it to a large dataset with PySpark.
First, I loaded the prepared Sklearn RF model (using joblib), loaded my data that contains the functions into the Spark framework, and then I added the prediction column with a user-defined function like this:
def predictClass(features): return rf.predict(features) udfFunction = udf(predictClass, StringType()) new_dataframe = dataframe.withColumn('prediction', udfFunction('features'))
It takes so long to start, is there a more efficient way to do the same? (without using Spark ML)
source share