How to use length functions of one hot coding?

Question

How to use length functions of one hot coding?

Given the list of options for the length option:

features = [ ['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2'] ]

where each sample has a variant number of functions, and the dtype function is str and is already one hot.

To use the object selection utilities for sklearn, I have to convert features to a 2D array that looks like this:

  f1 f2 f3 f4 f5 f6 s1 1 1 1 0 0 0 s2 0 1 0 1 1 1 s3 1 1 0 0 0 0

How can I achieve this through sklearn or numpy?

+5

python numpy pandas scikit-learn

Zelong Feb 22 '17 at 12:13

source share

2 answers

Here's one approach with NumPy methods and output as pandas dataframe -

 import numpy as np import pandas as pd lens = list(map(len, features)) N = len(lens) unq, col = np.unique(np.concatenate(features),return_inverse=1) row = np.repeat(np.arange(N), lens) out = np.zeros((N,len(unq)),dtype=int) out[row,col] = 1 indx = ['s'+str(i+1) for i in range(N)] df_out = pd.DataFrame(out, columns=unq, index=indx)

Example input, output -

 In [80]: features Out[80]: [['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2']] In [81]: df_out Out[81]: f1 f2 f3 f4 f5 f6 s1 1 1 1 0 0 0 s2 0 1 0 1 1 1 s3 1 1 0 0 0 0

+2

Divakar Feb 22 '17 at 12:26

source share

Vivek kumar · Accepted Answer · 2017-02-22T13:21:41+0000

You can use the MultiLabelBinarizer present in scikit, which is specifically used for this.

Code for your example:

 features = [ ['f1', 'f2', 'f3'], ['f2', 'f4', 'f5', 'f6'], ['f1', 'f2'] ] from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() new_features = mlb.fit_transform(features)

Output:

 array([[1, 1, 1, 0, 0, 0], [0, 1, 0, 1, 1, 1], [1, 1, 0, 0, 0, 0]])

It can also be used in the pipeline along with other feature_selection utilities.

How to use length functions of one hot coding?

More articles: