Removing overlapping dates from pandas dataframe

I have a pandas framework that looks like this:

ID date close 1 09/15/07 123.45 2 06/01/08 130.13 3 10/25/08 132.01 4 05/13/09 118.34 5 11/07/09 145.99 6 11/15/09 146.73 7 07/03/11 171.10 

I want to remove any overlapping lines.

Overlapping lines are defined as any line for X days of another line. For example, if X = 365. then the result should be:

 ID date close 1 09/15/07 123.45 3 10/25/08 132.01 5 11/07/09 145.99 7 07/03/11 171.10 

If X = 50, the result should be:

 ID date close 1 09/15/07 123.45 2 06/01/08 130.13 3 10/25/08 132.01 4 05/13/09 118.34 5 11/07/09 145.99 7 07/03/11 171.10 

I reviewed a few questions here, but did not find the right approach. For example, Pandas check if dates coincide on several lines and the fastest way to exclude certain dates from pandas dataframe are similar, but I don’t quite understand what I need.

Today I have the following ugly code that works for small X values, but when X gets bigger (e.g., with X = 365), it deletes all dates except the original date.

 filter_dates = [] for index, row in df.iterrows(): if observation_time == 'D': for i in range(1, observation_period): filter_dates.append((index.date() + timedelta(days=i))) df = df[~df.index.isin(filter_dates)] 

Any help / pointers would be appreciated!

Clarification:

To solve this problem, you need to look at each line, and not just the first line.

+5
source share
5 answers

I just used the elementary approach (in fact, this is a modified version of the OP approach), no fancy numpy or pandas ops, but linear instead of quadratic complexity (when using the distance approach).
However (like Corey Madden), I believe that the data is sorted relative to the date column. Hope this is correct:

Dataframe -> I use the pandas index here:

 import pandas as pd df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25", "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"], 'close':[123.45, 130.13, 132.01, 118.34, 145.99, 146.73, 171.10]}) df["date"]=pd.to_datetime(df["date"]) 

The following code block can be easily wrapped in a function and calculated the correct data indices for X = 365:

 X = 365 filter_ids = [0] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if (row["date"] - last_day).days > X: filter_ids.append(index) last_day = row["date"] 

and the result:

 print(df.loc[filter_ids,:]) close date 0 123.45 2007-09-15 2 132.01 2008-10-25 4 145.99 2009-11-07 6 171.10 2011-07-03 

note that the indices are shifted by one due to the index starting at zero.


I just wanted to comment on linear or quadratic complexity. My solution has linear time complexity, looking at each row of the data frame exactly once. The Cory maddens solution has quadratic complexity: at each iteration, each line of the data frame is opened. However, if X (day difference) is large, we can drop a huge portion of the end of the data set by doing very few iterations.

To this end, we could consider the following worst-case scenario for X=2 datasets:

 df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')}) 

On my machine, the following codes give:

 %%timeit X = 2 filter_ids = [0] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if (row["date"] -last_day).days > X: filter_ids.append(index) last_day = row["date"] 1 loop, best of 3: 7.06 s per loop 

and

 day_diffs = abs(df.iloc[0].date - df.date).dt.days i = 0 days = 2 idx = day_diffs.index[i] good_ids = {idx} while True: try: current_row = day_diffs[idx] day_diffs = day_diffs.iloc[1:] records_not_overlapping = (day_diffs - current_row) > days idx = records_not_overlapping[records_not_overlapping == True].index[0] good_ids.add(idx) except IndexError: break 1 loop, best of 3: 3min 16s per loop 
0
source

I found another solution for this (you can view the history of changes if you want to see the old ones). This is the best solution I've come across. It still stores the first sequential record, but it can be configured to save a record that comes in chronological order first (provided at the end).

 target = df.iloc[0] # Get the first item in the dataframe day_diff = abs(target.date - df.date) # Get the differences of all the other dates from the first item day_diff = day_diff.reset_index().sort_values(['date', 'index']) # Reset the index and then sort by date and original index so we can maintain the order of the dates day_diff.columns = ['old_index', 'date'] # rename old index column because of name clash good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values # Group the dates by range and then get the first item from each group df.iloc[good_ids] 

Once again, I conducted several tests to compare them with the QuickBeam method. The DataFrame used was 600,000 rows sorted randomly, and ordered by DateFrame with 73,000 rows:

My method:

 DataFrame days time 600k/random 2 1 loop, best of 3: 5.03 s per loop ordered 2 1 loop, best of 3: 564 ms per loop 600k/random 50 1 loop, best of 3: 5.17 s per loop ordered 50 1 loop, best of 3: 583 ms per loo 600k/random 365 1 loop, best of 3: 5.16 s per loop ordered 365 1 loop, best of 3: 577 ms per loop 

QuickBeam Method:

 DataFrame days time 600k/random 2 1 loop, best of 3: 52.8 s per loop ordered 2 1 loop, best of 3: 4.89 s per loop 600k/random 50 1 loop, best of 3: 53 s per loop ordered 50 1 loop, best of 3: 4.53 s per loop 600k/random 365 1 loop, best of 3: 53.7 s per loop ordered 365 1 loop, best of 3: 4.49 s per loop 

So yes, maybe I'm a little competitive ...

Exact functions used for testing:

 def my_filter(df, days): target = df.iloc[0] day_diff = abs(target.date - df.date) day_diff = day_diff.reset_index().sort_values(['date', 'index']) day_diff.columns = ['old_index', 'date'] good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values return df.iloc[good_ids] def quickbeam_filter(df, days): filter_ids = [0] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if (row["date"] - last_day).days > days: filter_ids.append(index) last_day = row["date"] return df.loc[filter_ids,:] 

If you want to get all dates starting in a certain range, which makes more sense to me, you can use this version:

 def my_filter(df, days): target = df.iloc[0] day_diff = abs(target.date - df.date) day_diff = day_diff.sort_values('date') good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().index.values return df.iloc[good_ids] 
+2
source

You can add a new column to filter the results:

 df['filter'] = df['date'] - df['date'][0] df['filter'] = df['filter'].apply(lambda x: x.days) 

Then to filter by 365 use this:

 df[df['filter']%365==0] 
+2
source

My approach was to calculate the distance matrix first

 distM = np.array([[np.timedelta64(abs(xy),'D').astype(int) for y in df.date] for x in df.date]) 

In your example, it will look like

 [[ 0 260 406 606 784 792 1387] [ 260 0 146 346 524 532 1127] [ 406 146 0 200 378 386 981] [ 606 346 200 0 178 186 781] [ 784 524 378 178 0 8 603] [ 792 532 386 186 8 0 595] [1387 1127 981 781 603 595 0]] 

Since we iterated down, we only need the distance from the upper triangle, so we modify the array by holding the upper part and setting a minimum of 365 to a large number of M , in which case I will use 10,000

 distM[np.triu(distM) <= 365] = 10000 

Then take argmin through the new distance matrix to decide which rows of the data block should be kept.

 remove = np.unique(np.argmin(distM,axis=1)) df = df.iloc[remove,:] 

and all together ...

 distM = np.array([[np.timedelta64(abs(xy),'D').astype(int) for y in df.date] for x in df.date]) distM[np.triu(distM)<= 365] = 10000 remove = np.unique(np.argmin(distM,axis=1)) df = df.iloc[remove,:] 
+1
source

For those looking for an answer that worked for me, here it is (based on @ Quickbeam2k1 answer):

 X = 50 #or whatever value you want remove_ids = [] last_day = df.loc[0, "date"] for index, row in df[1:].iterrows(): if np.busday_count(last_day, df.loc[index, "date"]) < X: remove_ids.append(index) else: last_day = df.loc[index, "date"] 
0
source

Source: https://habr.com/ru/post/1270743/


All Articles