This may be easier to explain with a sample dataset.
Create sample data
Suppose we have one Timestamps column, date and another column we would like to aggregate onto, a .
df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']), 'a':[9,5,1,2,3]}, columns=['date', 'a']) df date a 0 2012-01-01 9 1 2012-06-01 5 2 2015-01-01 1 3 2015-02-01 2 4 2015-03-01 3
There are several ways to group by year.
- Use dt accessory with
year property - Put
date in the index and use an anonymous function to access the year - Use
resample method - Convert to pandas Period
.dt accessor with year property
If you have a column (not an index) of pandas Timestamps, you can access many additional properties and methods using dt accessor. For example:
df['date'].dt.year 0 2012 1 2012 2 2015 3 2015 4 2015 Name: date, dtype: int64
We can use this to form our groups and calculate some aggregations in a specific column:
df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max']) sum mean max date 2012 14 7 9 2015 6 2 3
put date in index and use anonymous function to access year
If you set the date column as an index, it will become a DateTimeIndex with the same properties and methods as dt accessor gives normal columns
df1 = df.set_index('date') df1.index.year Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date')
Interestingly, when using the groupby method, you can pass a function to it. This function will be implicitly passed by the DataFrame index. Thus, we can get the same result from above with the following:
df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max']) sum mean max 2012 14 7 9 2015 6 2 3
Use the resample method
If the date column is not in the index, you must specify a column with the on parameter. You also need to specify the offset alias as a string.
df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max']) sum mean max date 2012-01-01 14.0 7.0 9.0 2013-01-01 NaN NaN NaN 2014-01-01 NaN NaN NaN 2015-01-01 6.0 2.0 3.0
Convert to pandas Period
You can also convert the date column to a pandas Period object. We must pass the offset alias as a string to determine the length of the Period.
df['date'].dt.to_period('A') 0 2012 1 2012 2 2015 3 2015 4 2015 Name: date, dtype: object
Then we can use this as a group
df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max']) sum mean max 2012 14 7 9 2015 6 2 3