How to delete unique lines in pandas framework?

Question

How to delete unique lines in pandas framework?

I am stuck in a seemingly easy problem: discarding unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates() .

Let's say this is my data:

  ABC 0 foo 0 A 1 foo 1 A 2 foo 1 B 3 bar 1 A

I would like to delete rows when A and B are unique, that is, I would like to save only rows 1 and 2.

I tried the following:

 # Load Dataframe df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]}) uniques = df[['A', 'B']].drop_duplicates() duplicates = df[~df.index.isin(uniques.index)]

But I only get line 2, since 0, 1 and 3 are in uniques!

+5

python pandas dataframe

toto_tico Jul 03 '17 at 15:04

source share

2 answers

I came up with a solution using groupby :

 groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'}) uniques = groupped[groupped['count'] == 1] duplicates = df[~df.index.isin(uniques.index)]

Duplicates now have the correct result:

  ABC 2 foo 1 B 3 bar 1 A

Also, my initial attempt at the question can be fixed by simply adding keep=False to the drop_duplicates method:

 # Load Dataframe df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]}) uniques = df[['A', 'B']].drop_duplicates(keep=False) duplicates = df[~df.index.isin(uniques.index)]

Please answer @jezrael, I think this is safer (?) Since I use pandas indices here.

0

toto_tico Jul 03 '17 at 15:04

source share

jezrael · Accepted Answer · 2017-07-03T15:08:01+0000

Solutions for selecting all duplicate rows:

You can use duplicated with a subset and the parameter keep=False to select all duplicates:

 df = df[df.duplicated(subset=['A','B'], keep=False)] print (df) ABC 1 foo 1 A 2 foo 1 B

Solution with transform :

 df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1] print (df) ABC 1 foo 1 A 2 foo 1 B

Slightly modified solutions for selecting all unique rows:

 #invert boolean mask by ~ df = df[~df.duplicated(subset=['A','B'], keep=False)] print (df) ABC 0 foo 0 A 3 bar 1 A df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1] print (df) ABC 0 foo 0 A 3 bar 1 A

How to delete unique lines in pandas framework?

More articles: