Simultaneous work of groupby and resample on pandas dataframe?

Question

Simultaneous work of groupby and resample on pandas dataframe?

My pandas dataframe consists of a categorical column JOB_TITLE, a numeric column BASE_SALARY, and a datetime index JOIN_DATE. I would like to aggregate categorical and downsampled datetimes as follows:

# Resampled at frequency of start data of every 5 years mean_agg = (df .groupby('JOB_TITLE') .resample('5AS')['BASE_SALARY'] .mean())

Unfortunately, as the groupby operation occurs before re-fetching, the recount operation is performed independently for each JOB_TITLE group. This leads to the following series:

 | JOB_TITLE | JOIN_DATE | | |-------------------|------------|-------| | Data Scientist | 2004-01-01 | 60000 | | | 2009-01-01 | 75000 | | | 2014-01-01 | 90000 | | | | | | Software Engineer | 2001-01-01 | 70000 | | | 2006-01-01 | 85000 | | | 2011-01-01 | 90000 | | | 2016-01-01 | 85000 |

As you can see, the indexes at the JOIN_DATE level for the Data Scientist and Software Engineer groups are not aligned. This creates a problem when you apply unstack to the JOB_TITLE level as follows:

 mean_agg.unstack('JOB_TITLE')

This results in the following file frame:

 | JOB_TITLE | Data Scientist | Software Engineer | |------------|----------------|-------------------| | JOIN_DATE | | | | 2001-01-01 | NaN | 70000 | | 2004-01-01 | 60000 | NaN | | 2006-01-01 | NaN | 85000 | | 2009-01-01 | 75000 | NaN | | 2011-01-01 | NaN | 70000 | | 2014-01-01 | 90000 | NaN | | 2016-01-01 | NaN | 85000 |

How can I avoid this sequential groupby and resample operation and instead do simultaneous work? Thanks!

+5

python pandas group-by time-series dataframe

S. Naribole Mar 18 '17 at 5:06

source share

1 answer

Scott boston · Accepted Answer · 2017-03-18T06:10:40+0000

Update Pandas 0.21 answer: pd.TimeGrouper is deprecated , use pd.Grouper instead.

 mean_agg = (df.groupby(['JOB_TITLE',pd.Grouper(freq='5AS')])['BASE_SALARY'] .mean()) mean_agg.unstack('JOB_TITLE')

Instead of using the resample parameter, try using pd.TimeGrouper

 mean_agg = (df .groupby(['JOB_TITLE',pd.TimeGrouper(freq='5AS')])['BASE_SALARY'] .mean()) mean_agg.unstack('JOB_TITLE')

TimeGrouper aligns cells in a grouped time range.

Simultaneous work of groupby and resample on pandas dataframe?

More articles: