Categorical and numeric functions - categorical target - Learn Scikit - Python

I have a dataset containing both categorical and numeric columns, and my target column is also categorical. I am using the Scikit library in Python34. I know that Scikit needs all categorical values โ€‹โ€‹that must be converted to numeric values โ€‹โ€‹before taking any approach to computer learning.

How to convert my categorical columns to numeric values? I tried a lot of things, but I get different errors, such as the str object does not have the numpy.ndarray object, it does not have the items attributes.

Here is an example of my data: UserID LocationID AmountPaid ServiceID Target 29876 IS345 23.9876 FRDG JFD 29877 IS712 135.98 WERS KOI 

My data set is saved in a CSV file, here is a little code that I wrote to give you an idea of โ€‹โ€‹what I want to do:

 #reading my csv file data_dir = 'C:/Users/davtalab/Desktop/data/' train_file = data_dir + 'train.csv' train = pd.read_csv( train_file ) #numeric columns: x_numeric_cols = train['AmountPaid'] #Categrical columns: categorical_cols = ['UserID' + 'LocationID' + 'ServiceID'] x_cat_cols = train[categorical_cols].as_matrix() y_target = train['Target'].as_matrix() 

I need x_cat_cols to convert to numeric values โ€‹โ€‹and add them to x_numeric_cols, and therefore have my full input values โ€‹โ€‹(x).

Then I need to also convert my target function to a numeric value and do it like my final column (y).

Then I want to make a random forest using these two sets:

 rf = RF(n_estimators=n_trees,max_features=max_features,verbose =verbose, n_jobs =n_jobs) rf.fit( x_train, y_train ) 

Thank you for your help!

+6
source share
2 answers

This was due to the way I list the data. If I print the data (using another sample), you will see:

 >>> import pandas as pd >>> train = pd.DataFrame({'a' : ['a', 'b', 'a'], 'd' : ['e', 'e', 'f'], ... 'b' : [0, 1, 1], 'c' : ['b', 'c', 'b']}) >>> samples = [dict(enumerate(sample)) for sample in train] >>> samples [{0: 'a'}, {0: 'b'}, {0: 'c'}, {0: 'd'}] 

This is a list of dicts. We should do this instead:

  >>> train_as_dicts = [dict(r.iteritems()) for _, r in train.iterrows()] >>> train_as_dicts [{'a': 'a', 'c': 'b', 'b': 0, 'd': 'e'}, {'a': 'b', 'c': 'c', 'b': 1, 'd': 'e'}, {'a': 'a', 'c': 'b', 'b': 1, 'd': 'f'}] Now we need to vectorize the dicts: >>> from sklearn.feature_extraction import DictVectorizer >>> vectorizer = DictVectorizer() >>> vectorized_sparse = vectorizer.fit_transform(train_as_dicts) >>> vectorized_sparse <3x7 sparse matrix of type '<type 'numpy.float64'>' with 12 stored elements in Compressed Sparse Row format> >>> vectorized_array = vectorized_sparse.toarray() >>> vectorized_array array([[ 1., 0., 0., 1., 0., 1., 0.], [ 0., 1., 1., 0., 1., 1., 0.], [ 1., 0., 1., 1., 0., 0., 1.]]) To get the meaning of each column, ask the vectorizer: >>> vectorizer.get_feature_names() ['a=a', 'a=b', 'b', 'c=b', 'c=c', 'd=e', 'd=f'] 
0
source

For the purpose you can use sklearn LabelEncoder . This will give you a converter from string labels to numeric (as well as backward matching). Example in the link.

In terms of functions, learning algorithms generally expect (or work best) ordinal data. Therefore, the best option is to use OneHotEncoder to convert categorical functions. This will create a new binary function for each category, indicating on / off for each category. Again, an example use in a link.

+4
source

Source: https://habr.com/ru/post/987316/


All Articles