Sklearn mask for onehotencoder not working

Given data such as:

from sklearn.preprocessing import OneHotEncoder
import numpy as np
dt = 'object, i4, i4'
d = np.array([('aaa', 1, 1), ('bbb', 2, 2)], dtype=dt)  

I want to exclude a text column using OHE functionality.

Why does the following not work?

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool))       
ohe.fit(d)
ValueError: could not convert string to float: 'bbb'

The documentation says:

categorical_features: "all" or array of indices or mask :
  Specify what features are treated as categorical.
   ‘all’ (default): All features are treated as categorical.
   array of indices: Array of categorical feature indices.
   mask: Array of length n_features and with dtype=bool.

I use a mask, but it is still trying to convert to float.

Even using

ohe = OneHotEncoder(categorical_features=np.array([False,True,True], dtype=bool), 
                    dtype=dt)        
ohe.fit(d)

The same mistakes.

And also in the case of an "array of indices":

ohe = OneHotEncoder(categorical_features=np.array([1, 2]), dtype=dt)        
ohe.fit(d)
+4
source share
3 answers

, Scikit-Learn . , . .

Pandas DataFrame - : https://github.com/paulgb/sklearn-pandas. ( )

import pandas as pd
import numpy as np
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

data = pd.DataFrame({'text':['aaa', 'bbb'], 'number_1':[1, 1], 'number_2':[2, 2]})

#    number_1  number_2 text
# 0         1         2  aaa
# 1         1         2  bbb

# SomeEncoder here must be any encoder which will help you to get
# numerical representation from text column
mapper = DataFrameMapper([
    ('text', SomeEncoder),
    (['number_1', 'number_2'], OneHotEncoder())
])
mapper.fit_transform(data)
+2

, - . , encoder , , .

, , () .

, aaa 5 bbb 6. , 1 2:

d = np.array([[5, 1, 1], [6, 2, 2]])
ohe = OneHotEncoder(categorical_features=np.array([True,False,False], dtype=bool))
ohe.fit(d)

:

ohe.active_features_
Out[22]: array([5, 6], dtype=int64)
+2

. , Scikit-Learn , , , categorical_features.

, _transform_selected() /sklearn/preprocessing/data.py,

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES).

This check is not performed if any of the data in the provided data frame Xcannot be successfully converted to float.

I agree that the documentation is sklearn.preprocessing.OneHotEncoderrather erroneous in this regard.

+1
source

Source: https://habr.com/ru/post/1618598/


All Articles