I already answered a similar question before.
Unfortunately, with MLLIb you cannot get probabilities for each instance for classification models up to version 1.4.1.
There are JIRA problems ( SPARK-4362 and SPARK-6885 ) regarding this exact topic, which is in PROGRESS when I write the answer now. However, the issue seems to have been suspended since November 2014.
There is currently no way to obtain the posterior probability of prediction with the Naive Bay model during the prediction. This should be available with the tag.
And here is a note from @ sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This has recently been discussed on this mailing list. You can't get the probabilities right now, but you can hack a bit to get the NaiveBayesModel internal data structures and compute them from there.
Link: source .
This issue was resolved with Spark 1.5.0. See the JARA issue for more details.
As far as AWS is concerned , little can be done for this right now. The solution may be, if you can deploy emr-bootstrap-action for a spark and configure it for your needs, then you can install Spark on AWS using the bootstrap step.
However, this may seem a bit complicated.
There is something you might need:
update spark/config.file to set spark-1.5. Sort of:
+3 1.5.0 python s3://support.elasticmapreduce/spark/install-spark-script.py s3://path.to.your.bucket.spark.installation/spark/1.5.0/spark-1.5.0.tgz
This list of files should contain the correct spark assembly located in the specific s3 bucket that you currently have.
To build your spark, I advise you to read about it in the section on building-spark-for-emr and also the official documentation. It should be! (I hope I haven’t forgotten anything)
EDIT: Amazon EMR release 4.1.0 offers an updated version of Apache Spark (1.5.0). You can check here for more details.
source share