Creating layouts for unique lists in columns in Python

I currently have the following data framework:

import pandas as pd df= pd.DataFrame({"ID" : ['1','2','3','4','5'], "col2" : [['a', 'b', 'c'], ['c', 'd', 'e', 'f'], ['f', 'b', 'f'], ['a', 'c', 'b'], ['b', 'a', 'b']]}) print(df) ID col2 0 1 [a, b, c] 1 2 [c, d, e, f] 2 3 [f, b, f] 3 4 [a, c, b] 4 5 [b, a, d] 

I want to create a new dataframe with layouts for col2, for example:

  ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 1 0 0 

Using the following code generates different columns for each of the letters in the column list:

 df2= df.col2.str.get_dummies(sep = ",") pd.concat([data['col1'], df], axis=1) ID abb] cc] dd] ef] [a [b [c [f 1 0 1 0 0 1 0 0 0 0 1 0 0 0 2 0 0 0 0 0 1 0 1 1 0 0 1 0 3 0 1 0 0 0 0 0 0 1 0 0 0 1 4 0 0 1 1 0 0 0 0 0 1 0 0 0 5 1 0 0 0 0 0 1 0 0 0 1 0 0 

Using the following code generates different columns for each of the letters in the column list according to the position in which they are located. Do you have any ideas why you can get through this? The pd.get_dummies option also does not work.

+5
source share
3 answers

str.get_dummies works well in strings, so you can turn your list into something to be shared, and use str_get_dummies in that string. For instance,

 df['col2'].str.join('@').str.get_dummies('@') Out: abcdef 0 1 1 1 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 3 1 1 1 0 0 0 4 1 1 0 0 0 0 

Here @ is an arbitrary character that does not appear in the list.

Then you can do as usual:

 pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1) Out: ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0 
+3
source

Using comprehension dicts can be faster

 In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int) Out[40]: abcdef 0 1 1 1 0 0 0 1 0 0 1 1 1 1 2 0 1 0 0 0 1 3 1 1 1 0 0 0 4 1 1 0 0 0 0 In [48]: pd.concat([ df['ID'], pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)], axis=1) Out[48]: ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0 

Delay

 In [2942]: df.shape Out[2942]: (50000, 2) In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int) 10 loops, best of 3: 137 ms per loop In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@') 1 loop, best of 3: 395 ms per loop 
+3
source

with the df you provided ... this works great

 def f1(x): # 1 if exist return pd.Series(1, set(x)) def f2(x): # count occurences return pd.value_counts(x) print(df.set_index('ID').col2.apply(f1).fillna(0).astype(int).reset_index()) print('') print(df.set_index('ID').col2.apply(f2).fillna(0).astype(int).reset_index()) ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 1 3 4 1 1 1 0 0 0 4 5 1 1 0 0 0 0 ID abcdef 0 1 1 1 1 0 0 0 1 2 0 0 1 1 1 1 2 3 0 1 0 0 0 2 3 4 1 1 1 0 0 0 4 5 1 2 0 0 0 0 
+1
source

Source: https://habr.com/ru/post/1260729/


All Articles