Pandas counting consecutive dates in a groupby object

Question

Pandas counting consecutive dates in a groupby object

This is an example of a data frame I'm working with:

d = { 'item_number':['bdsm1000', 'bdsm1000', 'bdsm1000', 'ZZRWB18','ZZRWB18', 'ZZRWB18', 'ZZRWB18', 'ZZHP1427BLK', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1427', 'ZZHP1414', 'ZZHP1414', 'ZZHP1414', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115WNTR', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE', 'WRM115SCFRE'], 'Comp_ID':[2454, 2454, 2454, 1395, 1395, 1395, 1395, 3378, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 1266941, 660867, 43978, 43978, 43978, 43978, 1197347907, 70745, 4737, 1197347907, 4737, 1197347907, 70745, 4737, 1197347907, 70745, 4737, 1197347907, 4737, 1197487704, 1197347907, 70745, 23872, 4737, 1197347907, 4737, 1197487704, 1197347907, 23872, 4737, 1197487704, 1197347907, 70745], 'date':['2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-20', '2016-11-19', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19', '2016-11-18', '2016-11-18', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-22', '2016-11-21', '2016-11-21', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-20', '2016-11-19', '2016-11-19', '2016-11-19']} df = pd.DataFrame(data=d) df.date = pd.to_datetime(df.date)

I would like to count consecutive observations starting from 2016-11-22 that they are grouped by Comp_ID and item_number.

Essentially, what I'm looking for counts how many days in a row there is an observation counting from today's date for each Comp_ID and item_number. (This example was compiled on November 22). Successive observations observed weeks / days prior to today are not relevant. Only sequences like today ... yesterday ... the day before yesterday ... etc. Relevant.

I got this to work with a smaller sample, but it looks like it works with a large dataset.

Here is the code for a smaller sample. I need to find consecutive dates with observations of thousands of sellers / items. For some reason, the code below did not work with a large dataset.

 d = {'item_number':['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'], 'Comp_ID':['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'], 'date':['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']} df = pd.DataFrame(data=d) df.date = pd.to_datetime(df.date) d = pd.Timedelta(1, 'D') df = df.sort_values(['item_number','date','Comp_ID'],ascending=False) g = df.groupby(['Comp_ID','item_number']) sequence = g['date'].apply(lambda x: x.diff().fillna(0).abs().le(d)).reset_index() sequence.set_index('index',inplace=True) test = df.join(sequence) test.columns = ['Comp_ID','date','item_number','consecutive'] g = test.groupby(['Comp_ID','item_number']) g['consecutive'].apply(lambda x: x.idxmin() - x.idxmax() )

This gives the desired result for a smaller data set:

 Comp_ID item_number 1395 G123H 2 KIN005 3 7787 KIN005 2 Name: consecutive, dtype: int64

+6

python pandas

Yale newman Nov 25 '16 at 18:46

source share

2 answers

Maxu · Answer 1 · 2016-11-27T20:09:43+0000

You can do it as follows:

 today = pd.to_datetime('2016-11-22') # sort DF by `date` (descending) x = df.sort_values('date', ascending=0) g = x.groupby(['Comp_ID','item_number']) # compare the # of days to `today` with a consecutive day# in each group x[(today - x['date']).dt.days == g.cumcount()].groupby(['Comp_ID','item_number']).size()

Result:

 Comp_ID item_number 1395 G123H 2 KIN005 3 7787 A789B 2 dtype: int64

PS thanks to @DataSwede for faster diff calculation

Explanation:

 In [124]: x[(today - x['date']).dt.days == g.cumcount()] \ .sort_values(['Comp_ID','item_number','date'], ascending=[1,1,0]) Out[124]: Comp_ID date item_number 8 1395 2016-11-22 G123H 9 1395 2016-11-21 G123H 0 1395 2016-11-22 KIN005 1 1395 2016-11-21 KIN005 2 1395 2016-11-20 KIN005 5 7787 2016-11-22 A789B 6 7787 2016-11-21 A789B

Bruce pucci · Answer 2 · 2016-11-25T22:13:44+0000

First, I suggest that we give a series of dates, each 1 day less than the previous one ...

 import datetime import pandas as pd def gen_prior_date(start_date): yield start_date while True: start_date -= datetime.timedelta(days=1) yield start_date

...

 >>> start_date = datetime.date(2016, 11, 22) >>> back_in_time = gen_prior_date(start_date) >>> next(back_in_time) datetime.date(2016, 11, 22) >>> next(back_in_time) datetime.date(2016, 11, 21)

Now we need a function that we can apply to each group ...

 def count_consec_dates(dates, start_date): dates = pd.to_datetime(dates.values).date dates_set = set(dates) # O(1) vs O(n) lookup times back_in_time = gen_prior_date(start_date) tally = 0 while next(back_in_time) in dates_set: # jump out on first miss tally += 1 return tally

The rest is easy ...

 >>> small_data = {'item_number': ['KIN005','KIN005','KIN005','KIN005','KIN005','A789B','A789B','A789B','G123H','G123H','G123H'], ... 'Comp_ID': ['1395','1395','1395','1395','1395','7787','7787','7787','1395','1395','1395'], ... 'date': ['2016-11-22','2016-11-21','2016-11-20','2016-11-14','2016-11-13','2016-11-22','2016-11-21','2016-11-12','2016-11-22','2016-11-21','2016-11-08']} >>> small_df = pd.DataFrame(data=small_data) >>> start_date = datetime.date(2016, 11, 22) >>> groups = small_df.groupby(['Comp_ID', 'item_number']).date >>> groups.apply(lambda x: count_consec_dates(x, start_date)) Comp_ID item_number 1395 G123H 2 KIN005 3 7787 A789B 2

Pandas counting consecutive dates in a groupby object

More articles: