Predict class probabilities in Spark RandomForestClassifier

Question

Predict class probabilities in Spark RandomForestClassifier

I built random forest models using ml.classification.RandomForestClassifier. I am trying to extract forecast probabilities from models, but I have only seen forecast classes instead of probabilities. According to this link, the problem has been fixed and leads to this github pull and this . However, it seems to be allowed in version 1.5. I use AWS EMR, which provides Spark 1.4.1, and the threshold does not know how to get the probability of a forecast. If someone knows how to do this, please share your thoughts or decisions. Thanks!

+3

scala apache-spark apache-spark-mllib

SH Y. Aug 27 '15 at 20:39

source share

3 answers

eliasah · Answer 1 · 2015-08-28T19:07:45+0000

I already answered a similar question before.

Unfortunately, with MLLIb you cannot get probabilities for each instance for classification models up to version 1.4.1.

There are JIRA problems ( SPARK-4362 and SPARK-6885 ) regarding this exact topic, which is in PROGRESS when I write the answer now. However, the issue seems to have been suspended since November 2014.

There is currently no way to obtain the posterior probability of prediction with the Naive Bay model during the prediction. This should be available with the tag.

And here is a note from @ sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:

This has recently been discussed on this mailing list. You can't get the probabilities right now, but you can hack a bit to get the NaiveBayesModel internal data structures and compute them from there.

Link: source .

This issue was resolved with Spark 1.5.0. See the JARA issue for more details.

As far as AWS is concerned , little can be done for this right now. The solution may be, if you can deploy emr-bootstrap-action for a spark and configure it for your needs, then you can install Spark on AWS using the bootstrap step.

However, this may seem a bit complicated.

There is something you might need:

update spark/config.file to set spark-1.5. Sort of:

 +3 1.5.0 python s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz

This list of files should contain the correct spark assembly located in the specific s3 bucket that you currently have.
To build your spark, I advise you to read about it in the section on building-spark-for-emr and also the official documentation. It should be! (I hope I haven’t forgotten anything)

EDIT: Amazon EMR release 4.1.0 offers an updated version of Apache Spark (1.5.0). You can check here for more details.

Holden · Answer 2 · 2015-08-27T21:12:56+0000

Unfortunately, this is not possible in version 1.4.1, you can extend the random forest class and copy some code that I added in this pull request if you cannot update it, but be sure to return to the normal ones after you can update his.

Jonathan · Answer 3 · 2015-09-30T19:58:59+0000

Spark 1.5.0 is now supported on EMR with the release of emr-4.1.0! You no longer need to use emr-bootstrap actions that btw only work on AMX 3.x, not emr-4.x.

Predict class probabilities in Spark RandomForestClassifier

More articles: