Pandas: delete all rows during the time interval of another time index of the sequence (i.e. time exceptions)

Suppose I have two data frames:

#df1 time 2016-09-12 13:00:00.017 1.0 2016-09-12 13:00:03.233 1.0 2016-09-12 13:00:10.256 1.0 2016-09-12 13:00:19.605 1.0 #df2 time 2016-09-12 13:00:00.017 1.0 2016-09-12 13:00:00.233 0.0 2016-09-12 13:00:01.016 1.0 2016-09-12 13:00:01.505 0.0 2016-09-12 13:00:06.017 1.0 2016-09-12 13:00:07.233 0.0 2016-09-12 13:00:08.256 1.0 2016-09-12 13:00:19.705 0.0 

I want to delete all rows in df2 that are up to +1 second from the time indices in df1 , therefore giving way:

 #result time 2016-09-12 13:00:01.505 0.0 2016-09-12 13:00:06.017 1.0 2016-09-12 13:00:07.233 0.0 2016-09-12 13:00:08.256 1.0 

What is the most effective way to do this? I don't see anything useful in time range exceptions in the API.

+5
source share
3 answers

You can use pd.merge_asof , which is a new inclusion starting at 0.19.0 , and also takes a tolerance argument to match +/- that specified amount of time interval.

 # Assuming time to be set as the index axis for both df's df1.reset_index(inplace=True) df2.reset_index(inplace=True) df2.loc[pd.merge_asof(df2, df1, on='time', tolerance=pd.Timedelta('1s')).isnull().any(1)] 

enter image description here

Please note that the default mapping is performed in the opposite direction, which means that the selection occurs on the last line in the right DataFrame ( df1 ), whose key "on" (which "time" ) is less than or equal to the left ( df2 ). Therefore, the tolerance parameter extends only in this direction (backward), which leads to a range of correspondence - .

In order to be able to 0.20.0 both direct and reverse searches starting from 0.20.0 , this can be achieved using the direction='nearest' argument and include it in the function call. In this regard, tolerance also expanding in both directions, which leads to a bandwidth range of +/- .

+11
source

A similar idea like @Nickil Maveli, but using reindex to create a logical indexer:

 df2 = df2[df1.reindex(df2.index, method='nearest', tolerance=pd.Timedelta('1s')).isnull()] 

Result:

 time 2016-09-12 13:00:01.505 0.0 2016-09-12 13:00:06.017 1.0 2016-09-12 13:00:07.233 0.0 2016-09-12 13:00:08.256 1.0 
+4
source

One way to do this would be to search by time indexing (assuming both time columns are indexes):

 td = pd.to_timedelta(1, unit='s') df2.apply(lambda row: df1[row.name - td:row.name].size > 0, axis=1) 
+1
source

Source: https://habr.com/ru/post/1259516/


All Articles