RandomForestClassifier , like many other ML algorithms, requires certain metadata that must be set in the label column, and label values ββthat are integer values ββfrom [0, 1, 2 ..., #classes) are represented as double. This is usually handled by Transformers upstream, like StringIndexer . Since you are converting labels manually, no metadata fields are specified, and the classifier cannot confirm that these requirements are met.
val df = Seq( (0.0, Vectors.dense(1, 0, 0, 0)), (1.0, Vectors.dense(0, 1, 0, 0)), (2.0, Vectors.dense(0, 0, 1, 0)), (2.0, Vectors.dense(0, 0, 0, 1)) ).toDF("label", "features") val rf = new RandomForestClassifier() .setFeaturesCol("features") .setNumTrees(5) rf.setLabelCol("label").fit(df) // java.lang.IllegalArgumentException: RandomForestClassifier was given input ...
You can recode the label column using StringIndexer :
import org.apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("label") .setOutputCol("label_idx") .fit(df) rf.setLabelCol("label_idx").fit(indexer.transform(df))
or manually set the required metadata :
val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0", "2.0") .toMetadata rf.setLabelCol("label_meta").fit( df.withColumn("label_meta", $"label".as("", meta)) )
Note
Shortcuts created with StringIndexer depend on the frequency of the non-value:
indexer.labels
Pyspark
In Python, metadata fields can be set directly in the schema:
from pyspark.sql.types import StructField, DoubleType StructField( "label", DoubleType(), False, {"ml_attr": { "name": "label", "type": "nominal", "vals": ["0.0", "1.0", "2.0"] }} )