Pandas groupby category, rating, get the maximum value from each category?

The first question about SO is very new to pandas and still a little shaky in terminology: I'm trying to figure out the correct syntax / sequence of operations on a data frame so that I can group by column B, find the maximum (or min) corresponding value for each group in the column C and get the corresponding value for column A.

Suppose this is my data file:

name type votes bob dog 10 pete cat 8 fluffy dog 5 max cat 9 

Using df.groupby('type').votes.agg('max') returns:

 dog 10 cat 9 

So far so good. However, I would like to figure out how to do this:

 dog 10 bob cat 9 max 

I got to df.groupby(['type', 'votes']).name.agg('max') , although it returns

 dog 5 fluffy 10 bob cat 8 pete 9 max 

... which is great for this pretending data frame, but doesn’t help much when dealing with much larger.

Thank you very much!

+6
source share
1 answer

If df has an index without duplicate values, you can use idxmax to return the maximum row index for each group. Then use df.loc to select the entire line:

 In [322]: df.loc[df.groupby('type').votes.agg('idxmax')] Out[322]: name type votes 3 max cat 9 0 bob dog 10 

If df.index has duplicate values, i.e. is not a unique index, then first make the index unique:

 df = df.reset_index() 

then use idxmax :

 result = df.loc[df.groupby('type').votes.agg('idxmax')] 

If you really need to, you can return df to its original state:

 df = df.set_index(['index'], drop=True) 

but overall life is much better with a unique index.


Here is an example showing what happens when df does not have a unique index. Assume index AABB :

 import pandas as pd df = pd.DataFrame({'name': ['bob', 'pete', 'fluffy', 'max'], 'type': ['dog', 'cat', 'dog', 'cat'], 'votes': [10, 8, 5, 9]}, index=list('AABB')) print(df) # name type votes # A bob dog 10 # A pete cat 8 # B fluffy dog 5 # B max cat 9 

idxmax returns index values A and B :

 print(df.groupby('type').votes.agg('idxmax')) type cat B dog A Name: votes, dtype: object 

But A and B do not explicitly indicate the required lines. df.loc[...] returns all rows whose index value is A or B :

 print(df.loc[df.groupby('type').votes.agg('idxmax')]) # name type votes # B fluffy dog 5 # B max cat 9 # A bob dog 10 # A pete cat 8 

In contrast, if the reset index is:

 df = df.reset_index() # index name type votes # 0 A bob dog 10 # 1 A pete cat 8 # 2 B fluffy dog 5 # 3 B max cat 9 

then df.loc can be used to select the desired lines:

 print(df.groupby('type').votes.agg('idxmax')) # type # cat 3 # dog 0 # Name: votes, dtype: int64 print(df.loc[df.groupby('type').votes.agg('idxmax')]) # index name type votes # 3 B max cat 9 # 0 A bob dog 10 
+6
source

Source: https://habr.com/ru/post/989256/


All Articles