How to do binary classification in Spark ML without StringIndexer

Question

How to do binary classification in Spark ML without StringIndexer

I am trying to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my function is already indexed as (0.0; 1.0). DecisionTreeClassifier as a label requires double values, so this code should work:

def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = { import sqlc.implicits._ val trainingDF = training.toDF() //format of this dataframe: [label: double, features: vector] val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(trainingDF) val dt = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("indexedFeatures") val pipeline = new Pipeline() .setStages(Array(featureIndexer, dt)) pipeline.fit(trainingDF) }

But actually I get

 java.lang.IllegalArgumentException: DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.

Of course, I can just put StringIndexer and let it work for my double label field, but I want to work with the rawPrediction DecisionTreeClassifier output column to get a probability of 0.0 and 1.0 for each row, for example ...

 val predictions = model.transform(singletonDF) val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0) val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)

If I put StringIndexer in Pipeline, I will not know the indexes of my input labels "0.0" and "1.0" in the rawPrediction vector, because the index of the String index will be indexed by the frequency of the values, which can change.

Please help prepare the data for DecisionTreeClassifier without using StringIndexer or suggest another way to get the probability of my source labels (0.0; 1.0) for each row.

0

scala classification apache-spark apache-spark-sql apache-spark-ml

Dmitry Spikhalskiy Mar 10 '16 at 16:37

source share

1 answer

zero323 · Accepted Answer · 2016-03-10T17:01:40+0000

You can always set the required metadata manually:

 import sqlContext.implicits._ import org.apache.spark.ml.attribute.NominalAttribute val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0") .toMetadata val dfWithMeta = df.withColumn("label", $"label".as("label", meta)) pipeline.fit(dfWithMeta)

How to do binary classification in Spark ML without StringIndexer

More articles: