Python Pandas - Working with Duplicates

I want to deal with duplicates in pandas df:

df=pd.DataFrame({'A':[1,1,1,2,1],'B':[2,2,1,2,1],'C':[2,2,1,1,1],'D':['a','c','a','c','c']})
df 

I want to store only rows with unique values ​​A, B, C, create binary columns D_a and D_c, so the results will be like this, without super slow loops in each row.

result= pd.DataFrame({'A':[1,1,2],'B':[2,1,2],'C':[2,1,1],'D_a':[1,1,0],'D_c':[1,1,1]})

Thank you so much

+4
source share
3 answers

Using get_dummies+ sum-

df = df.set_index(['A', 'B', 'C'])\
       .D.str.get_dummies()\
       .sum(level=[0, 1, 2])\
       .add_prefix('D_')\
       .reset_index()

df

   A  B  C  D_a  D_c
0  1  1  1    1    1
1  1  2  2    1    1
2  2  2  1    0    1
+2
source

You can use:

df1 = (df.groupby(['A','B','C'])['D']
         .value_counts()
         .unstack(fill_value=0)
         .add_prefix('D_')
         .clip_upper(1)
         .reset_index()  
         .rename_axis(None, axis=1))

print (df1)
   A  B  C  D_a  D_c
0  1  1  1    1    1
1  1  2  2    1    1
2  2  2  1    0    1
+3
source

You can do something like this

df.loc[df['D']=='a', 'D_a'] = 1
df.loc[df['D']=='c', 'D_c'] = 1

This will put 1 in a new column, where all "a" or "c" will appear.

    A   B   C   D   D_a  D_c
0   1   2   2   a   1.0  NaN
1   1   2   2   c   NaN  1.0
2   1   1   1   a   1.0  NaN
3   2   2   1   c   NaN  1.0
4   1   1   1   c   NaN  1.0

but then you need to replace NaN with 0.

df = df.fillna(0)

Then you only need to select the columns you want, and then remove the duplicates.

df = df[["A","B","C", "D_a", "D_c"]].drop_duplicates()

Hope this is the solution you were looking for.

+2
source

Source: https://habr.com/ru/post/1690780/


All Articles