How to convert a categorical variable in Spark into a set of columns encoded as {0,1}?

I am trying to perform a logical regression (LogisticRegressionWithLBFGS) with Spark MLlib (with Scala) in a dataset that contains categorical variables. I found that Spark was unable to work with such a variable.

R has a simple way to deal with this problem: I convert the variable to factor (categories), so R creates a set of columns encoded as indicator variables {0,1}.

How can I accomplish this with Spark?

+6
source share
4 answers

Using VectorIndexer , you can indicate to the indexer the number of different values โ€‹โ€‹(power) that the field can have in order to be categorical with the setMaxCategories () method.

val indexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexed") .setMaxCategories(10) 

From Scaladocs :

A class for indexing columns of a categorical function in a Vector dataset.

It has 2 usage modes:

Automatically identifies categorical functions (default behavior)

This helps to process a dataset of unknown vectors into a dataset with some continuous functions and some categorical functions. The choice between continuous and categorical is based on the maxCategories parameter.

Set maxCategories to the maximum number of categorical elements of any categorical function.

For example: Function 0 has unique values โ€‹โ€‹{-1.0, 0.0} and 1 values โ€‹โ€‹of the function {1.0, 3.0, 5.0}. If maxCategories = 2, then function 0 will be declared categorical and will use indices {0, 1}, and function 1 will be declared continuous.

I find this convenient (albeit crude) way to extract categorical values, but be careful if you have a field with a lower arity that you want to be continuous in any case (for example, college students age versus country of origin or US state).

+3
source

If I understand correctly, you do not want to convert 1 categorical column to multiple dummy columns. You want the spark to understand that the column is categorical, not numerical.

I think it depends on the algorithm you want to use right now. For example, random forest and GBT have both categorical FeaturesInfo as parameter, check it here:

https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest $

for example:

categoricalFeaturesInfo = Map[Int, Int]((1,2),(2,5))

actually says that the second column of your functions (the index starts at 0, so 1 is the second column) is categorical with 2 levels, and the third is a categorical function with 5 levels. You can specify these parameters when you train your RandomForest or GBT.

You need to make sure your levels are mapped to 0,1,2 ... so if you have something like ("good", "average", "bad"), match it (0,1,2) .

Now in your case you want to use LogisticRegressionWithLBFGS. In this case, my suggestion is to actually convert categorical columns to dummy columns. For example, one column with three levels ("good", "medium", "bad") in 3 columns with 1/0, depending on which one falls. I don't have an example to work with, so here is a sample code in scala that should work:

 val dummygen = (data : DataFrame, col:Array[String]) => { var temp = data for(i <- 0 until col.length) { val N = data.select(col(i)).distinct.count.toInt for (j<- 0 until N) temp = temp.withColumn(col(i) + "_" + j.toString, callUDF(index(j), DoubleType, data(col(i)))) } temp } val index = (value:Double) => {(a:Double) => { if (value==a) { 1 } else{ 0 } }} 

What can you name:

 val results = dummygen(data, Array("CategoricalColumn1","CategoricalColumn2")) 

Here I do it for a list of categorical columns (just in case, if there is more than 1 in the list of your functions). The first "cycle" goes through each categorical column, the second "for cycle" goes through each level in the column and creates the number of columns equal to the number of levels for each column.

Important!!! that it assumes that you first matched them with 0,1,2 ...

You can then start your LogisticRegressionWithLBFGS with this set of new features. This approach also helps with SVM.

+2
source

VectorIndexer is included with Spark 1.4, which can help you with this feature conversion: http://people.apache.org/~pwendell/spark-1.4.0-rc1-docs/api/scala/index.html#org.apache. spark.ml.feature.VectorIndexer

However, it looks like this will only be available in spark.ml and not in mllib

https://issues.apache.org/jira/browse/SPARK-4081

+1
source

If categories can fit into driver memory, here is my suggestion:

 import org.apache.spark.ml.feature.StringIndexer import org.apache.spark.sql.functions._ import org.apache.spark.sql._ val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c"),(6,"c"),(7,"d"),(8,"b")) .toDF("id", "category") val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) val categoriesIndecies = indexed.select("category","categoryIndex").distinct val categoriesMap: scala.collection.Map[String,Double] = categoriesIndecies.map(x=>(x(0).toString,x(1).toString.toDouble)).collectAsMap() def getCategoryIndex(catMap: scala.collection.Map[String,Double], expectedValue: Double) = udf((columnValue: String) => if (catMap(columnValue) == expectedValue) 1 else 0) val newDf:DataFrame =categoriesMap.keySet.toSeq.foldLeft[DataFrame](indexed)( (acc,c) => acc.withColumn(c,getCategoryIndex(categoriesMap,categoriesMap(c))($"category")) ) newDf.show +---+--------+-------------+---+---+---+---+ | id|category|categoryIndex| b| d| a| c| +---+--------+-------------+---+---+---+---+ | 0| a| 0.0| 0| 0| 1| 0| | 1| b| 2.0| 1| 0| 0| 0| | 2| c| 1.0| 0| 0| 0| 1| | 3| a| 0.0| 0| 0| 1| 0| | 4| a| 0.0| 0| 0| 1| 0| | 5| c| 1.0| 0| 0| 0| 1| | 6| c| 1.0| 0| 0| 0| 1| | 7| d| 3.0| 0| 1| 0| 0| | 8| b| 2.0| 1| 0| 0| 0| +---+--------+-------------+---+---+---+---+ 
0
source

Source: https://habr.com/ru/post/986754/


All Articles