How to conditionally remove duplicates from a pandas data frame

Question

How to conditionally remove duplicates from a pandas data frame

Consider the following data block

import pandas as pd df = pd.DataFrame({'A' : [1, 2, 3, 3, 4, 4, 5, 6, 7], 'B' : ['a','b','c','c','d','d','e','f','g'], 'Col_1' :[np.NaN, 'A','A', np.NaN, 'B', np.NaN, 'B', np.NaN, np.NaN], 'Col_2' :[2,2,3,3,3,3,4,4,5]}) df Out[92]: AB Col_1 Col_2 0 1 a NaN 2 1 2 b A 2 2 3 c A 3 3 3 c NaN 3 4 4 d B 3 5 4 d NaN 3 6 5 e B 4 7 6 f NaN 4 8 7 g NaN 5

I want to delete all rows that are duplicates relative to column 'A' 'B' . I want to delete a record with a NaN record (I know that for all dulicates there will be a NaN record and not- NaN record). The final results should look like this:

  AB Col_1 Col_2 0 1 a NaN 2 1 2 b A 2 2 3 c A 3 4 4 d B 3 6 5 e B 4 7 6 f NaN 4 8 7 g NaN 5

We welcome all effective, single-line.

+5

python pandas duplicates dataframe

mortysporty Aug 27 '17 at 16:41

source share

3 answers

If the goal is only to dump duplicate NaN , a slightly more active solution is required.

Sort A , B and Col_1 , so NaN moves down for each group. Then call df.drop_duplicates with keep=first :

 out = df.sort_values(['A', 'B', 'Col_1']).drop_duplicates(['A', 'B'], keep='first') print(out) AB Col_1 Col_2 0 1 a NaN 2 1 2 b A 2 2 3 c A 3 4 4 d B 3 6 5 e B 4 7 6 f NaN 4 8 7 g NaN 5

+5

cᴏʟᴅsᴘᴇᴇᴅ Aug 27 '17 at 16:44

source share

Or you can simply use first() , using the first, will return the first notnull value, so the order of the original input does not matter.

 df.groupby(['A','B']).first() Out[180]: Col_1 Col_2 AB 1 a NaN 2 2 b A 2 3 c A 3 4 d B 3 5 e B 4 6 f NaN 4 7 g NaN 5

+1

Wen Aug 27 '17 at 17:45

source share

cmaher · Accepted Answer · 2017-08-27T16:50:48+0000

Here's an alternative:

 df[~((df[['A', 'B']].duplicated(keep=False)) & (df.isnull().any(axis=1)))] # AB Col_1 Col_2 # 0 1 a NaN 2 # 1 2 b A 2 # 2 3 c A 3 # 4 4 d B 3 # 6 5 e B 4 # 7 6 f NaN 4 # 8 7 g NaN 5

This uses the bitwise operator "not" ~ to minimize lines that match the joint condition of duplicating the line (the keep=False argument causes the method to evaluate True for all non-historical lines) and containing at least one null value. So, where is the expression df[['A', 'B']].duplicated(keep=False) returns this series:

 # 0 False # 1 False # 2 True # 3 True # 4 True # 5 True # 6 False # 7 False # 8 False

... and the expression df.isnull().any(axis=1) returns this series:

 # 0 True # 1 False # 2 False # 3 True # 4 False # 5 True # 6 False # 7 True # 8 True

... we wrap both in parentheses (required by Pandas syntax when using multiple expressions in indexing operations), and then wrap them again in parentheses so that we can negate the whole expression (i.e. ~( ... ) ), for example like this:

 ~((df[['A','B']].duplicated(keep=False)) & (df.isnull().any(axis=1))) & (df['Col_2'] != 5) # 0 True # 1 True # 2 True # 3 False # 4 True # 5 False # 6 True # 7 True # 8 False

You can create more complex conditions with the further use of the logical operators & and | (operator "or"). As with SQL, group your conditions as needed using additional brackets; for example, a filter based on the logic “both conditions X and condition Y are true, or condition Z is true” with df[ ( (X) & (Y) ) | (Z) ] df[ ( (X) & (Y) ) | (Z) ] .

How to conditionally remove duplicates from a pandas data frame

More articles: