Pandas groupby category, rating, get the maximum value from each category?

Question

Pandas groupby category, rating, get the maximum value from each category?

The first question about SO is very new to pandas and still a little shaky in terminology: I'm trying to figure out the correct syntax / sequence of operations on a data frame so that I can group by column B, find the maximum (or min) corresponding value for each group in the column C and get the corresponding value for column A.

Suppose this is my data file:

name type votes bob dog 10 pete cat 8 fluffy dog 5 max cat 9

Using df.groupby('type').votes.agg('max') returns:

 dog 10 cat 9

So far so good. However, I would like to figure out how to do this:

 dog 10 bob cat 9 max

I got to df.groupby(['type', 'votes']).name.agg('max') , although it returns

 dog 5 fluffy 10 bob cat 8 pete 9 max

... which is great for this pretending data frame, but doesn’t help much when dealing with much larger.

Thank you very much!

+6

python pandas dataframe

user3817675 Jun 17 '15 at 0:39

source share

1 answer

unutbu · Accepted Answer · 2015-06-17T00:47:07+0000

If df has an index without duplicate values, you can use idxmax to return the maximum row index for each group. Then use df.loc to select the entire line:

 In [322]: df.loc[df.groupby('type').votes.agg('idxmax')] Out[322]: name type votes 3 max cat 9 0 bob dog 10

If df.index has duplicate values, i.e. is not a unique index, then first make the index unique:

 df = df.reset_index()

then use idxmax :

 result = df.loc[df.groupby('type').votes.agg('idxmax')]

If you really need to, you can return df to its original state:

 df = df.set_index(['index'], drop=True)

but overall life is much better with a unique index.

Here is an example showing what happens when df does not have a unique index. Assume index AABB :

 import pandas as pd df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'], 'type': ['dog', 'cat', 'dog', 'cat'], 'votes': [10, 8, 5, 9]}, index=list('AABB')) print(df) # name type votes # A bob dog 10 # A pete cat 8 # B fluffy dog 5 # B max cat 9

idxmax returns index values A and B :

 print(df.groupby('type').votes.agg('idxmax')) type cat B dog A Name: votes, dtype: object

But A and B do not explicitly indicate the required lines. df.loc[...] returns all rows whose index value is A or B :

 print(df.loc[df.groupby('type').votes.agg('idxmax')]) # name type votes # B fluffy dog 5 # B max cat 9 # A bob dog 10 # A pete cat 8

In contrast, if the reset index is:

 df = df.reset_index() # index name type votes # 0 A bob dog 10 # 1 A pete cat 8 # 2 B fluffy dog 5 # 3 B max cat 9

then df.loc can be used to select the desired lines:

 print(df.groupby('type').votes.agg('idxmax')) # type # cat 3 # dog 0 # Name: votes, dtype: int64 print(df.loc[df.groupby('type').votes.agg('idxmax')]) # index name type votes # 3 B max cat 9 # 0 A bob dog 10

Pandas groupby category, rating, get the maximum value from each category?

More articles: