There are two ways to make logical regression in Spark: spark.ml
and spark.mllib
.
With DataFrames you can use spark.ml
:
import org.apache.spark
import sqlContext.implicits._
def p(label: Double, a: Double, b: Double) =
new spark.mllib.regression.LabeledPoint(
label, new spark.mllib.linalg.DenseVector(Array(a, b)))
val data = sc.parallelize(Seq(p(1.0, 0.0, 0.5), p(0.0, 0.5, 1.0)))
val df = data.toDF
val model = new spark.ml.classification.LogisticRegression().fit(df)
model.transform(df).show
You get initial forecasts and probabilities:
+-----+---------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+---------+--------------------+--------------------+----------+
| 1.0|[0.0,0.5]|[-19.037302860930...|[5.39764620520461...| 1.0|
| 0.0|[0.5,1.0]|[18.9861466274786...|[0.99999999431904...| 0.0|
+-----+---------+--------------------+--------------------+----------+
With RDD, you can use spark.mllib
:
val model = new spark.mllib.classification.LogisticRegressionWithLBFGS().run(data)
This model does not reveal raw predictions and probabilities. You can take a look predictPoint
. He multiplies vectors and selects the class with the highest prediction. Scales are publicly available, so you can copy this algorithm and save the predictions instead of just returning the highest.
source
share