How to group pandas DataFrame records by date in a column other than unique

A Pandas DataFrame contains a column called "date" that contains non-unique datetime values. I can group the lines in this frame using:

 data.groupby(data['date']) 

However, this splits the data into datetime values. I would like to group this data by the year stored in the "date" column. This page shows how to group by year in cases where the timestamp is used as an index, which does not fit my case.

How to reach this grouping?

+44
python pandas
Jul 09 2018-12-12T00:
source share
5 answers

The ecatmur solution will work fine. However, this will be better for large datasets:

 data.groupby(data['date'].map(lambda x: x.year)) 
+64
Jul 09 2018-12-12T00:
source share

I am using pandas 0.16.2. This improves the performance of my large dataset:

 data.groupby(data.date.dt.year) 

Using the dt parameter and playing with weekofyear , dayofweek , etc. It becomes much easier.

+29
Sep 25 '15 at 13:55
source share

This should work:

 data.groupby(lambda x: data['date'][x].year) 
+11
Jul 09 2018-12-12T00:
source share

This may be easier to explain with a sample dataset.

Create sample data

Suppose we have one Timestamps column, date and another column we would like to aggregate onto, a .

 df = pd.DataFrame({'date':pd.DatetimeIndex(['2012-1-1', '2012-6-1', '2015-1-1', '2015-2-1', '2015-3-1']), 'a':[9,5,1,2,3]}, columns=['date', 'a']) df date a 0 2012-01-01 9 1 2012-06-01 5 2 2015-01-01 1 3 2015-02-01 2 4 2015-03-01 3 

There are several ways to group by year.

  • Use dt accessory with year property
  • Put date in the index and use an anonymous function to access the year
  • Use resample method
  • Convert to pandas Period

.dt accessor with year property

If you have a column (not an index) of pandas Timestamps, you can access many additional properties and methods using dt accessor. For example:

 df['date'].dt.year 0 2012 1 2012 2 2015 3 2015 4 2015 Name: date, dtype: int64 

We can use this to form our groups and calculate some aggregations in a specific column:

 df.groupby(df['date'].dt.year)['a'].agg(['sum', 'mean', 'max']) sum mean max date 2012 14 7 9 2015 6 2 3 



put date in index and use anonymous function to access year

If you set the date column as an index, it will become a DateTimeIndex with the same properties and methods as dt accessor gives normal columns

 df1 = df.set_index('date') df1.index.year Int64Index([2012, 2012, 2015, 2015, 2015], dtype='int64', name='date') 

Interestingly, when using the groupby method, you can pass a function to it. This function will be implicitly passed by the DataFrame index. Thus, we can get the same result from above with the following:

 df1.groupby(lambda x: x.year)['a'].agg(['sum', 'mean', 'max']) sum mean max 2012 14 7 9 2015 6 2 3 



Use the resample method

If the date column is not in the index, you must specify a column with the on parameter. You also need to specify the offset alias as a string.

 df.resample('AS', on='date')['a'].agg(['sum', 'mean', 'max']) sum mean max date 2012-01-01 14.0 7.0 9.0 2013-01-01 NaN NaN NaN 2014-01-01 NaN NaN NaN 2015-01-01 6.0 2.0 3.0 



Convert to pandas Period

You can also convert the date column to a pandas Period object. We must pass the offset alias as a string to determine the length of the Period.

 df['date'].dt.to_period('A') 0 2012 1 2012 2 2015 3 2015 4 2015 Name: date, dtype: object 

Then we can use this as a group

 df.groupby(df['date'].dt.to_period('Y'))['a'].agg(['sum', 'mean', 'max']) sum mean max 2012 14 7 9 2015 6 2 3 
+2
Nov 06 '17 at 15:34
source share

it will also work

data.groupby(data['date'].dt.year)

0
Oct 08 '17 at 20:39 on
source share



All Articles