Pandas - aggregate, sort and largest inside groupby

Question

Pandas - aggregate, sort and largest inside groupby

I have the following frame:

some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011

And I need to do something like transform ('size') followed by sorting and get N max values. To get something like this (N = 2):

  some_id size 2016-12-26 001 3 008 1 2016-12-27 009 2 003 1

Is there an elegant way to do this in pandas 0.19.x?

+5

python pandas

Alex zaitsev Dec 26 '16 at 16:32

source share

4 answers

customization

 from io import StringIO import pandas as pd txt = """ some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011""" df = pd.read_csv(StringIO(txt), sep='\s{2,}', engine='python') df.index = pd.to_datetime(df.index) df.some_id = df.some_id.astype(str).str.zfill(3) df some_id 2016-12-26 11:03:10 001 2016-12-26 11:03:13 001 2016-12-26 12:03:13 001 2016-12-26 12:03:13 008 2016-12-27 11:03:10 009 2016-12-27 11:03:13 009 2016-12-27 12:03:13 003 2016-12-27 12:03:13 011

using nlargest

 df.groupby(pd.TimeGrouper('D')).some_id.value_counts() \ .groupby(level=0, group_keys=False).nlargest(2) some_id 2016-12-26 001 3 008 1 2016-12-27 009 2 003 1 Name: some_id, dtype: int64

+2

piRSquared Dec 26 '16 at 17:06

source share

You should be able to do this on one line.

 df.resample('D')['some_id'].apply(lambda s: s.value_counts().iloc[:2])

+2

Ted petrou Dec 26 '16 at 18:31

source share

If you already have a sizes column, you can use the following.

 df.groupby('some_id')['size'].value_counts().groupby(level=0).nlargest(2)

Otherwise, you can use this approach.

 import pandas as pd df = pd.DataFrame({'some_id':[1,1,1,8,9,9,3,11], 'some_idx':[26,26,26,26,27,27,27,27]}) sizes = df.groupby(['some_id', 'some_idx']).size() sizes.groupby(level='some_idx').nlargest(2) # some_idx some_id some_idx # 26 1 26 3 # 8 26 1 # 27 9 27 2 # 3 27 1

0

3novak Dec 26 '16 at 16:39

source share

Nickil maveli · Accepted Answer · 2016-12-26T16:35:46+0000

Use value_counts to calculate various values after grouping in the date part of your DateTimeIndex . By default, they sort them in descending order.

You just need to take the top 2 lines of this result to get the largest (top).

 fnc = lambda x: x.value_counts().head(2) grp = df.groupby(df.index.date)['some_id'].apply(fnc).reset_index(1, name='size') grp.rename(columns={'level_1':'some_id'})

Pandas - aggregate, sort and largest inside groupby

More articles: