SparkR 2.0 Classification: How to get performance matrices?

Question

SparkR 2.0 Classification: How to get performance matrices?

How to get performance matrices in sparkR classification, e.g. F1 score, Precision, Recall, Confusion Matrix

# Load training data df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") training <- df testing <- df # Fit a random forest classification model with spark.randomForest model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10) # Model summary summary(model) # Prediction predictions <- predict(model, testing) head(predictions) # Performance evaluation

I tried caret::confusionMatrix(testing$label,testing$prediction) , it shows an error:

  Error in unique.default(x, nmax = nmax) : unique() applies only to vectors

0

machine-learning apache-spark apache-spark-sql spark-dataframe sparkr

Happycode Jul 30 '17 at 14:25

source share

1 answer

desertnaut · Accepted Answer · 2017-09-06T14:22:53+0000

Caret confusionMatrix will not work because it needs R-frames while your data is in Spark frames.

One not recommended way to get your metrics is to "collect" locally your Spark frames in R using as.data.frame , and then use caret , etc .; but this means that your data can fit into the main memory of your driver, and in this case, of course, you have absolutely no reason to use Spark ...

So, here is a way to get accuracy in a distributed way (i.e. not collecting data locally) using iris data as an example:

 sparkR.version() # "2.1.1" df <- as.DataFrame(iris) model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10) predictions <- predict(model, df) summary(predictions) # SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string] createOrReplaceTempView(predictions, "predictions") correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species") count(correct) # 149 acc = count(correct)/count(predictions) acc # 0.9933333

(As for 149 correct predictions from 150 samples, if you do showDF(predictions, numRows=150) , you will see that there is one virginica sample classified as versicolor ).

SparkR 2.0 Classification: How to get performance matrices?

More articles: