Caret confusionMatrix will not work because it needs R-frames while your data is in Spark frames.
One not recommended way to get your metrics is to "collect" locally your Spark frames in R using as.data.frame , and then use caret , etc .; but this means that your data can fit into the main memory of your driver, and in this case, of course, you have absolutely no reason to use Spark ...
So, here is a way to get accuracy in a distributed way (i.e. not collecting data locally) using iris data as an example:
sparkR.version() # "2.1.1" df <- as.DataFrame(iris) model <- spark.randomForest(df, Species ~ ., "classification", numTrees = 10) predictions <- predict(model, df) summary(predictions) # SparkDataFrame[summary:string, Sepal_Length:string, Sepal_Width:string, Petal_Length:string, Petal_Width:string, Species:string, prediction:string] createOrReplaceTempView(predictions, "predictions") correct <- sql("SELECT prediction, Species FROM predictions WHERE prediction=Species") count(correct) # 149 acc = count(correct)/count(predictions) acc # 0.9933333
(As for 149 correct predictions from 150 samples, if you do showDF(predictions, numRows=150) , you will see that there is one virginica sample classified as versicolor ).
source share