What is the way to represent factor variables in scikit-learn when using Random Forests?

I solve the classification problem using random forests. For this, I decided to use the scikit-learn Python library. But I'm new to both the Random Forest algorithm and this tool. My data contains many variable factors. I searched this for Google and found out that it wasn’t the way to give numerical values ​​to factor variables, like in linear regression, since it would treat it as a continuous variable and give the wrong result. But I could not find anything about how to deal with factor variables in scikit-learn. Please tell me which options to use or point me to some document where I can get it.

+4
source share
2 answers

If you use the pandas data frame, you can easily use the get_dummies function to accomplish this. Here is an example:

import pandas as pd my_data = [['a','b'],['b','a'],['c','b'],['d','a'],['a','c']] df = pd.DataFrame(my_data, columns = ['var1','var2']) dummy_ranks = pd.get_dummies(df['var1'], prefix = 'var1_') print dummy_ranks var1__a var1__b var1__c var1__d 0 1 0 0 0 1 0 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 0 0 0 [5 rows x 4 columns] 
+11
source

You should use sklearn OneHotEncoder . It creates a new variable for each individual value in your categorical integer function.

So, for example, if you have a var variable with values [10, 25, 30] , it will create three new variables (ie a matrix with 3 columns), essentially with variables var_10 , var_25 and var_30 with values [1, 0, 0] , [0, 1, 0] and [0, 0, 1] respectively.

+2
source

Source: https://habr.com/ru/post/1480092/


All Articles