Pandas find duplicates in cross values

Question

Pandas find duplicates in cross values

I have a dataframe and you want to remove duplicate rows that have the same values, but in different columns:

df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'}) df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'}) df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'}) df Out[8]: abcd 1 xyef 2 efxy 3 wvst

Rows [1], [2] have values {x, y, e, f}, but they are located in the cross - that is, if you exchange columns c, d with a, b in row [2] you will have a duplicate. I want to leave these lines and save only one to get the final output:

 df_new Out[20]: abcd 1 xyef 3 wvst

How can I do this effectively?

+5

python pandas duplicates

maxtenzin Jan 9 '18 at 10:22

source share

3 answers

Here's another solution with a for loop:

 data = df.as_matrix() new = [] for row in data: if not new: new.append(row) else: if not any([c in nrow for nrow in new for c in row]): new.append(row) new_df = pd.DataFrame(new, columns=df.columns)

+1

zipa Jan 9 '18 at 10:39

source share

Use collation ( np.sort ) and then get duplicates ( .duplicated() ). Later use duplicates to remove ( df.drop ) the required index

 import pandas as pd import numpy as np df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3']) df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'}) df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'}) df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'}) df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated() index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]] df.drop(df.index[df_duplicated])

+1

Shrey Jan 9 '18 at 10:41

source share

jezrael · Accepted Answer · 2018-01-09T10:25:02+0000

It seems to me that you need a boolean indexing filter with a mask created by numpy.sort with duplicated , to invert use ~ :

 df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()] print (df) abcd 1 xyef 3 wvst

More details:

 print (np.sort(df, axis=1)) [['e' 'f' 'x' 'y'] ['e' 'f' 'x' 'y'] ['s' 't' 'v' 'w']] print (pd.DataFrame(np.sort(df, axis=1), index=df.index)) 0 1 2 3 1 efxy 2 efxy 3 stvw print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()) 1 False 2 True 3 False dtype: bool print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()) 1 True 2 False 3 True dtype: bool

Pandas find duplicates in cross values

More articles: