What is the way to represent factor variables in scikit-learn when using Random Forests?

Question

What is the way to represent factor variables in scikit-learn when using Random Forests?

I solve the classification problem using random forests. For this, I decided to use the scikit-learn Python library. But I'm new to both the Random Forest algorithm and this tool. My data contains many variable factors. I searched this for Google and found out that it wasn’t the way to give numerical values to factor variables, like in linear regression, since it would treat it as a continuous variable and give the wrong result. But I could not find anything about how to deal with factor variables in scikit-learn. Please tell me which options to use or point me to some document where I can get it.

+4

scikit-learn random-forest text-mining

Prince kumar May 10 '13 at 10:42

source share

2 answers

jay s · Answer 1 · 2014-08-15T21:16:14+0000

If you use the pandas data frame, you can easily use the get_dummies function to accomplish this. Here is an example:

import pandas as pd my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']] df = pd.DataFrame(my_data, columns = ['var1','var2']) dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_') print dummy_ranks var1__a var1__b var1__c var1__d 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 0 0 0 [5 rows x 4 columns]

Ando saabas · Answer 2 · 2013-05-10T12:42:21+0000

You should use sklearn OneHotEncoder . It creates a new variable for each individual value in your categorical integer function.

So, for example, if you have a var variable with values [10, 25, 30] , it will create three new variables (ie a matrix with 3 columns), essentially with variables var_10 , var_25 and var_30 with values [1, 0, 0] , [0, 1, 0] and [0, 0, 1] respectively.

What is the way to represent factor variables in scikit-learn when using Random Forests?

More articles: