Pandas time-based merge that does not match exactly

Question

Pandas time-based merge that does not match exactly

What methods are available to combine columns with timestamps that do not match exactly?

DF1:

date start_time employee_id session_id 01/01/2016 01/01/2016 06:03:13 7261824 871631182

DF2:

 date start_time employee_id session_id 01/01/2016 01/01/2016 06:03:37 7261824 871631182

I could join ['date', 'employee_id', 'session_id'], but sometimes the same employee will have several identical sessions on the same date that causes duplicates. I could delete the lines where this happens, but I would lose valid sessions if I did this.

Is there an efficient way to connect if the timestamp of DF1 is within 5 minutes of the timestamp of DF2, as well as session_id and employee_id? If there is a corresponding record, then the timestamp will always be a little later than DF1, because the event is fired at some future point.

 ['employee_id', 'session_id', 'timestamp<5minutes']

Change I assumed that someone had encountered this problem before.

I thought about it:

Take my timestamp for each data frame.
Create a column that is a timestamp + 5 minutes (rounded)
Create a column that is a timestamp - 5 minutes (rounded)

Create a 10 minute interval line to join files on

 df1['low_time'] = df1['start_time'] - timedelta(minutes=5) df1['high_time'] = df1['start_time'] + timedelta(minutes=5) df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)

Does anyone know how to round these 5 minute intervals to the nearest 5 minute mark?

02:59:37 - 5 min = 02:55:00

02:59:37 + 5 min = 03:05:00

interval_string = '02: 55: 00-03: 05: 00 '

 pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']

Does anyone know how to round time? This seems to work. You still match the date, employee, and session, and then you look for a time that is basically within the same 10 minute interval or range.

+5

python pandas

trench Jan 19 '16 at 15:21

source share

1 answer

Igor Raush · Accepted Answer · 2016-01-20T21:23:28+0000

Consider the following mini version of your problem:

 from io import StringIO from pandas import read_csv, to_datetime # how close do sessions have to be to be considered equal? (in minutes) threshold = 5 # datetime column (combination of date + start_time) dtc = [['date', 'start_time']] # index column (above combination) ixc = 'date_start_time' df1 = read_csv(StringIO(u''' date,start_time,employee_id,session_id 01/01/2016,02:03:00,7261824,871631182 01/01/2016,06:03:00,7261824,871631183 01/01/2016,11:01:00,7261824,871631184 01/01/2016,14:01:00,7261824,871631185 '''), parse_dates=dtc) df2 = read_csv(StringIO(u''' date,start_time,employee_id,session_id 01/01/2016,02:03:00,7261824,871631182 01/01/2016,06:05:00,7261824,871631183 01/01/2016,11:04:00,7261824,871631184 01/01/2016,14:10:00,7261824,871631185 '''), parse_dates=dtc)

which gives

 >>> df1 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:03:00 7261824 871631183 2 2016-01-01 11:01:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 >>> df2 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:10:00 7261824 871631185

You want to treat df2[0:3] as duplicates of df1[0:3] during merging (since they are less than 5 minutes, respectively), but treat df1[3] and df2[3] as separate sessions.

Solution 1: Compliance Interval

This, in fact, is what you offer in your editing. You want to match the timestamps in both tables with a 10-minute interval centered on a timestamp rounded to the nearest 5 minutes.

Each interval can be represented uniquely in the middle, so you can combine data frames with a time stamp rounded to the nearest 5 minutes. For instance:

 import numpy as np # half-threshold in nanoseconds threshold_ns = threshold * 60 * 1e9 # compute "interval" to which each session belongs df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns) df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns) # join cols = ['interval', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols]

which prints

  interval employee_id session_id 0 2016-01-01 02:05:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:00:00 7261824 871631184 3 2016-01-01 14:00:00 7261824 871631185 4 2016-01-01 11:05:00 7261824 871631184 5 2016-01-01 14:10:00 7261824 871631185

Please note that this is not entirely correct. The sessions df1[2] and df2[2] not considered duplicates, although they are only 3 minutes apart. This is because they were on opposite sides of the interval boundary.

Solution 2: One to One Match

Here is another approach that depends on the sessions in df1 having zero or one duplicate in df2 .

We replace the timestamps in df1 closest timestamp in df2 , which corresponds to employee_id and session_id and is less than 5 minutes.

 from datetime import timedelta # get closest match from "df2" to row from "df1" (as long as it below the threshold) def closest(row): matches = df2.loc[(df2.employee_id == row.employee_id) & (df2.session_id == row.session_id)] deltas = matches.date_start_time - row.date_start_time deltas = deltas.loc[deltas <= timedelta(minutes=threshold)] try: return matches.loc[deltas.idxmin()] except ValueError: # no items return row # replace timestamps in "df1" with closest timestamps in "df2" df1 = df1.apply(closest, axis=1) # join cols = ['date_start_time', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols]

which prints

  date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 4 2016-01-01 14:10:00 7261824 871631185

This approach is much slower, since for each line in df1 you need to search all df2 . What I wrote may possibly be optimized further, but it will take a lot of time on large data sets.

Pandas time-based merge that does not match exactly

Solution 1: Compliance Interval

Solution 2: One to One Match

More articles: