Pandas: remove backward duplicates from dataframe

I have a data frame with two columns, A and B In this context, the order of A and B is unimportant; for example, I would consider (0,50) and (50,0) as duplicates. In pandas, what is an effective way to remove these duplicates from a data frame?

 import pandas as pd # Initial data frame. data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 'B': [50, 22, 35, 5, 10, 11, 21, 0]}) data AB 0 0 50 1 10 22 2 11 35 3 21 5 4 22 10 5 35 11 6 5 21 7 50 0 # Desired output with "duplicates" removed. data2 = pd.DataFrame({'A': [0, 5, 10, 11], 'B': [50, 21, 22, 35]}) data2 AB 0 0 50 1 5 21 2 10 22 3 11 35 

Ideally, the output will be sorted by column A values.

+6
source share
2 answers

You can sort each row of data before deleting duplicates:

 data.apply(lambda r: sorted(r), axis = 1).drop_duplicates() # AB #0 0 50 #1 10 22 #2 11 35 #3 5 21 

If you want the result to be sorted by column A :

 data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A') # AB #0 0 50 #3 5 21 #1 10 22 #2 11 35 
+9
source

Here is a bit ugly but faster solution:

 In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates() Out[44]: AB 0 0 50 1 10 22 2 11 35 3 5 21 

Timing: for lines 8K DF

 In [50]: big = pd.concat([data] * 10**3, ignore_index=True) In [51]: big.shape Out[51]: (8000, 2) In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates() 1 loop, best of 3: 3.04 s per loop In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates() 100 loops, best of 3: 3.96 ms per loop In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates() 1 loop, best of 3: 2.69 s per loop 
+4
source

Source: https://habr.com/ru/post/1012122/


All Articles