Pandas find duplicates in cross values

I have a dataframe and you want to remove duplicate rows that have the same values, but in different columns:

df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'}) df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'}) df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'}) df Out[8]: abcd 1 xyef 2 efxy 3 wvst 

Rows [1], [2] have values ​​{x, y, e, f}, but they are located in the cross - that is, if you exchange columns c, d with a, b in row [2] you will have a duplicate. I want to leave these lines and save only one to get the final output:

 df_new Out[20]: abcd 1 xyef 3 wvst 

How can I do this effectively?

+5
source share
3 answers

It seems to me that you need a boolean indexing filter with a mask created by numpy.sort with duplicated , to invert use ~ :

 df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()] print (df) abcd 1 xyef 3 wvst 

More details:

 print (np.sort(df, axis=1)) [['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']] print (pd.DataFrame(np.sort(df, axis=1), index=df.index)) 0 1 2 3 1 efxy 2 efxy 3 stvw print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()) 1 False 2 True 3 False dtype: bool print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()) 1 True 2 False 3 True dtype: bool 
+4
source

Here's another solution with a for loop:

 data = df.as_matrix() new = [] for row in data: if not new: new.append(row) else: if not any([c in nrow for nrow in new for c in row]): new.append(row) new_df = pd.DataFrame(new, columns=df.columns) 
+1
source

Use collation ( np.sort ) and then get duplicates ( .duplicated() ). Later use duplicates to remove ( df.drop ) the required index

 import pandas as pd import numpy as np df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'}) df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'}) df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'}) df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated() index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]] df.drop(df.index[df_duplicated]) 
+1
source

Source: https://habr.com/ru/post/1274591/


All Articles