Pandas - aggregate, sort and largest inside groupby

I have the following frame:

some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011 

And I need to do something like transform ('size') followed by sorting and get N max values. To get something like this (N = 2):

  some_id size 2016-12-26 001 3 008 1 2016-12-27 009 2 003 1 

Is there an elegant way to do this in pandas 0.19.x?

+5
source share
4 answers

Use value_counts to calculate various values ​​after grouping in the date part of your DateTimeIndex . By default, they sort them in descending order.

You just need to take the top 2 lines of this result to get the largest (top).

 fnc = lambda x: x.value_counts().head(2) grp = df.groupby(df.index.date)['some_id'].apply(fnc).reset_index(1, name='size') grp.rename(columns={'level_1':'some_id'}) 

enter image description here

+4
source

customization

 from io import StringIO import pandas as pd txt = """ some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011""" df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python') df.index = pd.to_datetime(df.index) df.some_id = df.some_id.astype(str).str.zfill(3) df some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011 

using nlargest

 df.groupby(pd.TimeGrouper('D')).some_id.value_counts() \ .groupby(level=0, group_keys=False).nlargest(2) some_id 2016-12-26 001 3 008 1 2016-12-27 009 2 003 1 Name: some_id, dtype: int64 
+2
source

You should be able to do this on one line.

 df.resample('D')['some_id'].apply(lambda s: s.value_counts().iloc[:2]) 
+2
source

If you already have a sizes column, you can use the following.

 df.groupby('some_id')['size'].value_counts().groupby(level=0).nlargest(2) 

Otherwise, you can use this approach.

 import pandas as pd df = pd.DataFrame({'some_id':[1,1,1,8,9,9,3,11], 'some_idx':[26,26,26,26,27,27,27,27]}) sizes = df.groupby(['some_id', 'some_idx']).size() sizes.groupby(level='some_idx').nlargest(2) # some_idx some_id some_idx # 26 1 26 3 # 8 26 1 # 27 9 27 2 # 3 27 1 
0
source

Source: https://habr.com/ru/post/1261916/


All Articles