I just used the elementary approach (in fact, this is a modified version of the OP approach), no fancy numpy or pandas ops, but linear instead of quadratic complexity (when using the distance approach).
However (like Corey Madden), I believe that the data is sorted relative to the date column. Hope this is correct:
Dataframe -> I use the pandas index here:
import pandas as pd df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25", "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"], 'close':[123.45, 130.13, 132.01, 118.34, 145.99, 146.73, 171.10]}) df["date"]=pd.to_datetime(df["date"])
The following code block can be easily wrapped in a function and calculated the correct data indices for X = 365:
X = 365 filter_ids = [0] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if (row["date"] - last_day).days > X: filter_ids.append(index) last_day = row["date"]
and the result:
print(df.loc[filter_ids,:]) close date 0 123.45 2007-09-15 2 132.01 2008-10-25 4 145.99 2009-11-07 6 171.10 2011-07-03
note that the indices are shifted by one due to the index starting at zero.
I just wanted to comment on linear or quadratic complexity. My solution has linear time complexity, looking at each row of the data frame exactly once. The Cory maddens solution has quadratic complexity: at each iteration, each line of the data frame is opened. However, if X (day difference) is large, we can drop a huge portion of the end of the data set by doing very few iterations.
To this end, we could consider the following worst-case scenario for X=2 datasets:
df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})
On my machine, the following codes give:
%%timeit X = 2 filter_ids = [0] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if (row["date"] -last_day).days > X: filter_ids.append(index) last_day = row["date"] 1 loop, best of 3: 7.06 s per loop
and
day_diffs = abs(df.iloc[0].date - df.date).dt.days i = 0 days = 2 idx = day_diffs.index[i] good_ids = {idx} while True: try: current_row = day_diffs[idx] day_diffs = day_diffs.iloc[1:] records_not_overlapping = (day_diffs - current_row) > days idx = records_not_overlapping[records_not_overlapping == True].index[0] good_ids.add(idx) except IndexError: break 1 loop, best of 3: 3min 16s per loop