Context: I have a data frame where all categorical values have been indexed using StringIndexer.
val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name } val categoryIndexers = categoricalColumns.map { col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") }
Then I used VectorAssembler to vectorize all function columns (including indexed categorical ones).
val assembler = new VectorAssembler() .setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns)) .setOutputCol("features")
After applying the classifier and several additional steps, I get a data frame with a label, functions and forecast. I would like to expand my vector of column separator objects to convert indexed values back to their original String form.
val categoryConverters = categoricalColumns.zip(categoryIndexers).map { colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels) }
Question: Is there an easy way to do this or a better approach to somehow bind a forecast column to a frame of test data?
What I tried:
val featureSlicers = categoricalColumns.map { col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed")) }
Applying this, I need the columns that I want, but they are in vector form (as it should be done), and not of type Double.
Edit: The desired result is the original data frame (i.e. Categorical functions as String not index) with an additional column representing the predicted label (which in my case is 0 or 1).
For example, let's say that the result of my classifier looked something like this:
+-----+---------+----------+ |label| features|prediction| +-----+---------+----------+ | 1.0|[0.0,3.0]| 1.0| +-----+---------+----------+
Using VectorSlicer for each function, I get:
+-----+---------+----------+-------------+-------------+ |label| features|prediction|statusIndexed|artistIndexed| +-----+---------+----------+-------------+-------------+ | 1.0|[0.0,3.0]| 1.0| [0.0]| [3.0]| +-----+---------+----------+-------------+-------------+
This is great, but I need:
+-----+---------+----------+-------------+-------------+ |label| features|prediction|statusIndexed|artistIndexed| +-----+---------+----------+-------------+-------------+ | 1.0|[0.0,3.0]| 1.0| 0.0 | 3.0 | +-----+---------+----------+-------------+-------------+
To be able to use IndexToString and convert it to:
+-----+---------+----------+-------------+-------------+ |label| features|prediction| status | artist | +-----+---------+----------+-------------+-------------+ | 1.0|[0.0,3.0]| 1.0| good | Pink Floyd | +-----+---------+----------+-------------+-------------+
or even:
+-----+----------+-------------+-------------+ |label|prediction| status | artist | +-----+----------+-------------+-------------+ | 1.0| 1.0| good | Pink Floyd | +-----+----------+-------------+-------------+