Consider the following mini version of your problem:
from io import StringIO from pandas import read_csv, to_datetime
which gives
>>> df1 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:03:00 7261824 871631183 2 2016-01-01 11:01:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 >>> df2 date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:10:00 7261824 871631185
You want to treat df2[0:3] as duplicates of df1[0:3] during merging (since they are less than 5 minutes, respectively), but treat df1[3] and df2[3] as separate sessions.
Solution 1: Compliance Interval
This, in fact, is what you offer in your editing. You want to match the timestamps in both tables with a 10-minute interval centered on a timestamp rounded to the nearest 5 minutes.
Each interval can be represented uniquely in the middle, so you can combine data frames with a time stamp rounded to the nearest 5 minutes. For instance:
import numpy as np
which prints
interval employee_id session_id 0 2016-01-01 02:05:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:00:00 7261824 871631184 3 2016-01-01 14:00:00 7261824 871631185 4 2016-01-01 11:05:00 7261824 871631184 5 2016-01-01 14:10:00 7261824 871631185
Please note that this is not entirely correct. The sessions df1[2] and df2[2] not considered duplicates, although they are only 3 minutes apart. This is because they were on opposite sides of the interval boundary.
Solution 2: One to One Match
Here is another approach that depends on the sessions in df1 having zero or one duplicate in df2 .
We replace the timestamps in df1 closest timestamp in df2 , which corresponds to employee_id and session_id and is less than 5 minutes.
from datetime import timedelta # get closest match from "df2" to row from "df1" (as long as it below the threshold) def closest(row): matches = df2.loc[(df2.employee_id == row.employee_id) & (df2.session_id == row.session_id)] deltas = matches.date_start_time - row.date_start_time deltas = deltas.loc[deltas <= timedelta(minutes=threshold)] try: return matches.loc[deltas.idxmin()] except ValueError: # no items return row # replace timestamps in "df1" with closest timestamps in "df2" df1 = df1.apply(closest, axis=1) # join cols = ['date_start_time', 'employee_id', 'session_id'] print df1.merge(df2, on=cols, how='outer')[cols]
which prints
date_start_time employee_id session_id 0 2016-01-01 02:03:00 7261824 871631182 1 2016-01-01 06:05:00 7261824 871631183 2 2016-01-01 11:04:00 7261824 871631184 3 2016-01-01 14:01:00 7261824 871631185 4 2016-01-01 14:10:00 7261824 871631185
This approach is much slower, since for each line in df1 you need to search all df2 . What I wrote may possibly be optimized further, but it will take a lot of time on large data sets.