RandomForestClassifier received input with invalid column column error in Apache Spark

I am trying to find Accuracy using 5x cross-validation using the random forest classifier model in SCALA. But while working, I get the following error:

java.lang.IllegalArgumentException: RandomForestClassifier received input with an invalid label column label without specifying the number of specified classes. See StringIndexer.

Getting the above error in line ---> val cvModel = cv.fit (trainingData)

The code I used to cross-validate a dataset using a random forest looks like this:

import org.apache.spark.ml.Pipeline import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator} import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint val data = sc.textFile("exprogram/dataset.txt") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(41).toDouble, Vectors.dense(parts(0).split(',').map(_.toDouble))) } val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0) val test = splits(1) val trainingData = training.toDF() val testData = test.toDF() val nFolds: Int = 5 val NumTrees: Int = 5 val rf = new RandomForestClassifier() .setLabelCol("label") .setFeaturesCol("features") .setNumTrees(NumTrees) val pipeline = new Pipeline() .setStages(Array(rf)) val paramGrid = new ParamGridBuilder() .build() val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("precision") val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(nFolds) val cvModel = cv.fit(trainingData) val results = cvModel.transform(testData) .select("label","prediction").collect val numCorrectPredictions = results.map(row => if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _) val accuracy = 1.0D * numCorrectPredictions / results.size println("Test set accuracy: %.3f".format(accuracy)) 

Can someone explain what is the error in the above code.

+5
source share
1 answer

RandomForestClassifier , like many other ML algorithms, requires certain metadata that must be set in the label column, and label values ​​that are integer values ​​from [0, 1, 2 ..., #classes) are represented as double. This is usually handled by Transformers upstream, like StringIndexer . Since you are converting labels manually, no metadata fields are specified, and the classifier cannot confirm that these requirements are met.

 val df = Seq( (0.0, Vectors.dense(1, 0, 0, 0)), (1.0, Vectors.dense(0, 1, 0, 0)), (2.0, Vectors.dense(0, 0, 1, 0)), (2.0, Vectors.dense(0, 0, 0, 1)) ).toDF("label", "features") val rf = new RandomForestClassifier() .setFeaturesCol("features") .setNumTrees(5) rf.setLabelCol("label").fit(df) // java.lang.IllegalArgumentException: RandomForestClassifier was given input ... 

You can recode the label column using StringIndexer :

 import org.apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(df) rf.setLabelCol("label_idx").fit(indexer.transform(df)) 

or manually set the required metadata :

 val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0", "2.0") .toMetadata rf.setLabelCol("label_meta").fit( df.withColumn("label_meta", $"label".as("", meta)) ) 

Note

Shortcuts created with StringIndexer depend on the frequency of the non-value:

 indexer.labels // Array[String] = Array(2.0, 0.0, 1.0) 

Pyspark

In Python, metadata fields can be set directly in the schema:

 from pyspark.sql.types import StructField, DoubleType StructField( "label", DoubleType(), False, {"ml_attr": { "name": "label", "type": "nominal", "vals": ["0.0", "1.0", "2.0"] }} ) 
+8
source

Source: https://habr.com/ru/post/1246769/


All Articles