If I understand correctly, you do not want to convert 1 categorical column to multiple dummy columns. You want the spark to understand that the column is categorical, not numerical.
I think it depends on the algorithm you want to use right now. For example, random forest and GBT have both categorical FeaturesInfo as parameter, check it here:
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest $
for example:
categoricalFeaturesInfo = Map[Int, Int]((1,2),(2,5))
actually says that the second column of your functions (the index starts at 0, so 1 is the second column) is categorical with 2 levels, and the third is a categorical function with 5 levels. You can specify these parameters when you train your RandomForest or GBT.
You need to make sure your levels are mapped to 0,1,2 ... so if you have something like ("good", "average", "bad"), match it (0,1,2) .
Now in your case you want to use LogisticRegressionWithLBFGS. In this case, my suggestion is to actually convert categorical columns to dummy columns. For example, one column with three levels ("good", "medium", "bad") in 3 columns with 1/0, depending on which one falls. I don't have an example to work with, so here is a sample code in scala that should work:
val dummygen = (data : DataFrame, col:Array[String]) => { var temp = data for(i <- 0 until col.length) { val N = data.select(col(i)).distinct.count.toInt for (j<- 0 until N) temp = temp.withColumn(col(i) + "_" + j.toString, callUDF(index(j), DoubleType, data(col(i)))) } temp } val index = (value:Double) => {(a:Double) => { if (value==a) { 1 } else{ 0 } }}
What can you name:
val results = dummygen(data, Array("CategoricalColumn1","CategoricalColumn2"))
Here I do it for a list of categorical columns (just in case, if there is more than 1 in the list of your functions). The first "cycle" goes through each categorical column, the second "for cycle" goes through each level in the column and creates the number of columns equal to the number of levels for each column.
Important!!! that it assumes that you first matched them with 0,1,2 ...
You can then start your LogisticRegressionWithLBFGS with this set of new features. This approach also helps with SVM.