How to remove duplicate from DataFrame taking into account value of another column

When I drop John as a duplicate specifying "name" as the column name:

 import pandas as pd data = {'name':['Bill','Steve','John','John','John'], 'age':[21,28,22,30,29]} df = pd.DataFrame(data) df = df.drop_duplicates('name') 

pandas removes all matching objects, leaving the leftmost:

  age name 0 21 Bill 1 28 Steve 2 22 John 

Instead, I would like to keep the line where John is the highest age (in this example, this is age 30. How to achieve this?

+5
source share
1 answer

Try the following:

 In [75]: df Out[75]: age name 0 21 Bill 1 28 Steve 2 22 John 3 30 John 4 29 John In [76]: df.sort_values('age').drop_duplicates('name', keep='last') Out[76]: age name 0 21 Bill 1 28 Steve 3 30 John 

or it depends on your goals:

 In [77]: df.drop_duplicates('name', keep='last') Out[77]: age name 0 21 Bill 1 28 Steve 4 29 John 
+4
source

Source: https://habr.com/ru/post/1258301/


All Articles