Fastest way to exclude specific dates from pandas dataframe

I work with a large data frame, and I try my best to find an effective way to eliminate specific dates. Please note that I am trying to eliminate any measurements from a specific date.

Pandas has this great feature where you can call:

df.ix['2016-04-22'] 

and pull out all the lines from this day. But what if I want to delete all lines from "2016-04-22"?

I need a function like this:

 df.ix[~'2016-04-22'] 

(but it does not work)

Also, what if I want to delete a date list?

Now I have the following solution:

 import numpy as np import pandas as pd from numpy import random ###Create a sample data frame dates = [pd.Timestamp('2016-04-25 06:48:33'), pd.Timestamp('2016-04-27 15:33:23'), pd.Timestamp('2016-04-23 11:23:41'), pd.Timestamp('2016-04-28 12:08:20'), pd.Timestamp('2016-04-21 15:03:49'), pd.Timestamp('2016-04-23 08:13:42'), pd.Timestamp('2016-04-27 21:18:22'), pd.Timestamp('2016-04-27 18:08:23'), pd.Timestamp('2016-04-27 20:48:22'), pd.Timestamp('2016-04-23 14:08:41'), pd.Timestamp('2016-04-27 02:53:26'), pd.Timestamp('2016-04-25 21:48:31'), pd.Timestamp('2016-04-22 12:13:47'), pd.Timestamp('2016-04-27 01:58:26'), pd.Timestamp('2016-04-24 11:48:37'), pd.Timestamp('2016-04-22 08:38:46'), pd.Timestamp('2016-04-26 13:58:28'), pd.Timestamp('2016-04-24 15:23:36'), pd.Timestamp('2016-04-22 07:53:46'), pd.Timestamp('2016-04-27 23:13:22')] values = random.normal(20, 20, 20) df = pd.DataFrame(index=dates, data=values, columns ['values']).sort_index() ### This is the list of dates I want to remove removelist = ['2016-04-22', '2016-04-24'] 

This for loop basically captures the index for the dates I want to delete, then removes it from the index of the main data frame, and then positively selects the remaining dates (i.e.: good dates) from the data block.

 for r in removelist: elimlist = df.ix[r].index.tolist() ind = df.index.tolist() culind = [i for i in ind if i not in elimlist] df = df.ix[culind] 

Is there anything better?

I also tried indexing with a rounded date + 1 day, so something like this:

 df[~((df['Timestamp'] < r+pd.Timedelta("1 day")) & (df['Timestamp'] > r))] 

But it becomes very cumbersome and (at the end of the day), I will still use the for loop when I need to eliminate n specific dates.

There has to be a better way! Right? May be?

+1
source share
3 answers

Same idea as @Alexander, but using the DatetimeIndex and numpy.in1d :

 mask = ~np.in1d(df.index.date, pd.to_datetime(removelist).date) df = df.loc[mask, :] 

Timings:

 %timeit df.loc[~np.in1d(df.index.date, pd.to_datetime(removelist).date), :] 1000 loops, best of 3: 1.42 ms per loop %timeit df[[d.date() not in pd.to_datetime(removelist) for d in df.index]] 100 loops, best of 3: 3.25 ms per loop 
+1
source

You can create a boolean mask using list comprehension.

 >>> df[[d.date() not in pd.to_datetime(removelist) for d in df.index]] values 2016-04-21 15:03:49 28.059520 2016-04-23 08:13:42 -22.376577 2016-04-23 11:23:41 40.350252 2016-04-23 14:08:41 14.557856 2016-04-25 06:48:33 -0.271976 2016-04-25 21:48:31 20.156240 2016-04-26 13:58:28 -3.225795 2016-04-27 01:58:26 51.991293 2016-04-27 02:53:26 -0.867753 2016-04-27 15:33:23 31.585201 2016-04-27 18:08:23 11.639641 2016-04-27 20:48:22 42.968156 2016-04-27 21:18:22 27.335995 2016-04-27 23:13:22 13.120088 2016-04-28 12:08:20 53.730511 
+3
source

Perhaps useful

 df = df.drop(pd.to_datetime('2016-04-22')) 

This explicitly finds and removes the row with an index equal to pd.to_datetime('2016-04-22') , and returns the rest. If you want to remove more, you can pass an iteration. I used pd.to_datetime because drop does not automatically convert to datetime if it looks and feels like datetime, like ix .

The problem with this sentence is that if the element in the passed iterable is not in the index, it fails. There is work around, but at this point the answers @Alexander and @root are more elegant.

-1
source

Source: https://habr.com/ru/post/1270748/


All Articles