How to get one hot coded vector, as in the table below

I am trying to get a table in the following form. For some reason, I was not able to get my consolidated code to work.

df = pd.DataFrame([('a','f1'), ('a','f2'),('a','f3') ,('b','f4'),('c','f2'), ('c','f4')], columns = ['user', 'val'])


df 
---
user    val
a      f1
a      f2
a      f3
b      f4
c      f2
c      f4 


>> output 

user    f1  f2  f3  f4
a       1   1   1   0
b       0   0   0   1
c       1   0   1   0
+4
source share
3 answers

Option 1
get_dummies s groupby+sum

df.set_index('user').val.str.get_dummies().sum(level=0)

      f1  f2  f3  f4
user                
a      1   1   1   0
b      0   0   0   1
c      0   1   0   1

Option 2
groupby + value_counts+unstack

df.groupby('user').val.value_counts().unstack(fill_value=0)

val   f1  f2  f3  f4
user                
a      1   1   1   0
b      0   0   0   1
c      0   1   0   1

Option 3
pivot_table with sizehow aggfunc.

df.pivot_table(index='user', columns='val', aggfunc='size', fill_value=0)

val   f1  f2  f3  f4
user                
a      1   1   1   0
b      0   0   0   1
c      0   1   0   1
+5
source

It seems pd.crosstab(df['user'], df['val'])to work too.

+3
source

Another solution.

In [82]: from sklearn.feature_extraction.text import CountVectorizer

In [83]: cv = CountVectorizer()

In [84]: d2 = df.groupby('user')['val'].agg(' '.join).reset_index(name='val')

In [85]: d2
Out[85]:
  user       val
0    a  f1 f2 f3
1    b        f4
2    c     f2 f4

In [86]: r = pd.SparseDataFrame(cv.fit_transform(d2['val']),
    ...:                                 d2.index,
    ...:                                 cv.get_feature_names(),
    ...:                                 default_fill_value=0)
    ...:

In [88]: d2[['user']].join(r)
Out[88]:
  user  f1  f2  f3  f4
0    a   1   1   1   0
1    b   0   0   0   1
2    c   0   1   0   1
+2
source

Source: https://habr.com/ru/post/1693380/


All Articles