I am trying to use Spark ML DecisionTreeClassifier in Pipeline without StringIndexer, because my function is already indexed as (0.0; 1.0). DecisionTreeClassifier as a label requires double values, so this code should work:
def trainDecisionTreeModel(training: RDD[LabeledPoint], sqlc: SQLContext): Unit = { import sqlc.implicits._ val trainingDF = training.toDF() //format of this dataframe: [label: double, features: vector] val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(trainingDF) val dt = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("indexedFeatures") val pipeline = new Pipeline() .setStages(Array(featureIndexer, dt)) pipeline.fit(trainingDF) }
But actually I get
java.lang.IllegalArgumentException: DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See StringIndexer.
Of course, I can just put StringIndexer and let it work for my double label field, but I want to work with the rawPrediction DecisionTreeClassifier output column to get a probability of 0.0 and 1.0 for each row, for example ...
val predictions = model.transform(singletonDF) val zeroProbability = predictions.select("rawPrediction").asInstanceOf[Vector](0) val oneProbability = predictions.select("rawPrediction").asInstanceOf[Vector](1)
If I put StringIndexer in Pipeline, I will not know the indexes of my input labels "0.0" and "1.0" in the rawPrediction vector, because the index of the String index will be indexed by the frequency of the values, which can change.
Please help prepare the data for DecisionTreeClassifier without using StringIndexer or suggest another way to get the probability of my source labels (0.0; 1.0) for each row.
source share