Group by year / month / day in pandas

Question

Group by year / month / day in pandas

Suppose you have the following DataFrame :

 rng = pd.date_range('1/1/2011', periods=72, freq='H') np.random.seed(10) n = 10 df = pd.DataFrame( { "datetime": np.random.choice(rng,n), "cat": np.random.choice(['a','b','b'], n), "val": np.random.randint(0,5, size=n) } )

If now groupby :

 gb = df.groupby(['cat','datetime']).sum()

I get the totals for each cat for every hour:

 cat datetime val a 2011-01-01 00:00:00 1 2011-01-01 09:00:00 3 2011-01-02 16:00:00 1 2011-01-03 16:00:00 1 b 2011-01-01 08:00:00 4 2011-01-01 15:00:00 3 2011-01-01 16:00:00 3 2011-01-02 04:00:00 4 2011-01-02 05:00:00 1 2011-01-02 12:00:00 4

However, I would like to have something like:

 cat datetime val a 2011-01-01 4 2011-01-02 1 2011-01-03 1 b 2011-01-01 10 2011-01-02 9

I could get the desired result by adding another column called date :

 df['date'] = df.datetime.apply(pd.datetime.date)

and then do the same groupby : df.groupby(['cat','date']).sum() . But I'm interested in, is there still a pythonic way to do this? In addition, I could take a look at a month or a year. So what will be the right way?

+5

python pandas data-analysis business-intelligence

Drror Mar 09 '16 at 15:32

source share

2 answers

jezrael · Answer 1 · 2016-03-09T15:53:18+0000

You can try set_index and then groupby via cat and date :

 import pandas as pd import numpy as np rng = pd.date_range('1/1/2011', periods=72, freq='H') np.random.seed(10) n = 10 df = pd.DataFrame( { "datetime": np.random.choice(rng,n), "cat": np.random.choice(['a','b','b'], n), "val": np.random.randint(0,5, size=n) } ) print df cat datetime val 0 a 2011-01-01 09:00:00 3 1 b 2011-01-01 15:00:00 3 2 a 2011-01-03 16:00:00 1 3 b 2011-01-02 04:00:00 4 4 b 2011-01-02 05:00:00 1 5 b 2011-01-01 08:00:00 4 6 a 2011-01-01 00:00:00 1 7 a 2011-01-02 16:00:00 1 8 b 2011-01-02 12:00:00 4 9 b 2011-01-01 16:00:00 3

 df = df.set_index('datetime') gb = df.groupby(['cat', lambda x: x.date]).sum() print gb val cat a 2011-01-01 4 2011-01-02 1 2011-01-03 1 b 2011-01-01 10 2011-01-02 9

Randy c · Answer 2 · 2016-03-09T16:16:02+0000

From your intermediate structure, you can use .unstack to separate the categories, make .resample , and then .stack again to return to the original form:

 In [126]: gb = df.groupby(['cat', 'datetime']).sum() In [127]: gb.unstack(0) Out[127]: val cat ab datetime 2011-01-01 00:00:00 1.0 NaN 2011-01-01 08:00:00 NaN 4.0 2011-01-01 09:00:00 3.0 NaN 2011-01-01 15:00:00 NaN 3.0 2011-01-01 16:00:00 NaN 3.0 2011-01-02 04:00:00 NaN 4.0 2011-01-02 05:00:00 NaN 1.0 2011-01-02 12:00:00 NaN 4.0 2011-01-02 16:00:00 1.0 NaN 2011-01-03 16:00:00 1.0 NaN In [128]: gb.unstack(0).resample("D").sum().stack() Out[128]: val datetime cat 2011-01-01 a 4.0 b 10.0 2011-01-02 a 1.0 b 9.0 2011-01-03 a 1.0

EDIT: for other resample rates (month, year, etc.) there is a good list of options in the pandas reprogramming documentation

Group by year / month / day in pandas

More articles: