Creating layouts for unique lists in columns in Python

Question

Creating layouts for unique lists in columns in Python

I currently have the following data framework:

import pandas as pd df= pd.DataFrame({"ID" : ['1','2','3','4','5'], "col2" : [['a', 'b', 'c'], ['c', 'd', 'e', 'f'], ['f', 'b', 'f'], ['a', 'c', 'b'], ['b', 'a', 'b']]}) print(df) ID col2 0 1 [a, b, c] 1 2 [c, d, e, f] 2 3 [f, b, f] 3 4 [a, c, b] 4 5 [b, a, d]

I want to create a new dataframe with layouts for col2, for example:

  ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 1 0 0

Using the following code generates different columns for each of the letters in the column list:

 df2= df.col2.str.get_dummies(sep = ",") pd.concat([data['col1'], df], axis=1) ID abb] cc] dd] ef] [a [b [c [f 1 0 1 0 0 1 0 0 0 0 1 0 0 0 2 0 0 0 0 0 1 0 1 1 0 0 1 0 3 0 1 0 0 0 0 0 0 1 0 0 0 1 4 0 0 1 1 0 0 0 0 0 1 0 0 0 5 1 0 0 0 0 0 1 0 0 0 1 0 0

Using the following code generates different columns for each of the letters in the column list according to the position in which they are located. Do you have any ideas why you can get through this? The pd.get_dummies option also does not work.

+5

python python-2.7 pandas

Carlos cardona Dec 02 '16 at 18:30

source share

3 answers

Using comprehension dicts can be faster

 In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int) Out[40]: abcdef 0 1 1 1 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 3 1 1 1 0 0 0 4 1 1 0 0 0 0 In [48]: pd.concat([ df['ID'], pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)], axis=1) Out[48]: ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0

Delay

 In [2942]: df.shape Out[2942]: (50000, 2) In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int) 10 loops, best of 3: 137 ms per loop In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@') 1 loop, best of 3: 395 ms per loop

+3

Zero Oct 16 '17 at 15:12

source share

with the df you provided ... this works great

 def f1(x): # 1 if exist return pd.Series(1, set(x)) def f2(x): # count occurences return pd.value_counts(x) print(df.set_index('ID').col2.apply(f1).fillna(0).astype(int).reset_index()) print('') print(df.set_index('ID').col2.apply(f2).fillna(0).astype(int).reset_index()) ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0 ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 2 3 4 1 1 1 0 0 0 4 5 1 2 0 0 0 0

+1

piRSquared Dec 02 '16 at 18:45

source share

ayhan · Accepted Answer · 2016-12-02T18:38:42+0000

str.get_dummies works well in strings, so you can turn your list into something to be shared, and use str_get_dummies in that string. For instance,

 df['col2'].str.join('@').str.get_dummies('@') Out: abcdef 0 1 1 1 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 3 1 1 1 0 0 0 4 1 1 0 0 0 0

Here @ is an arbitrary character that does not appear in the list.

Then you can do as usual:

 pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1) Out: ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0

Creating layouts for unique lists in columns in Python

More articles: