Pandas: union with exact identifier and nearest date

Question

Pandas: union with exact identifier and nearest date

I am trying to combine two Pandas frames in two columns. One column has a unique identifier that can be used for a simple .merge()two data. However, the second column merge actually uses .merge_asof(), because it will need to find the nearest date, not the exact date match.

There is a similar question here: Pandas Combine the name and the nearest date , but it was asked and answered almost three years ago, and merge_asof()a much newer addition.

I asked a similar question here a couple of months ago, but the solution was necessary to use merge_asof()without any exact matches.

In the interest of including some code, it would look something like this:

df = pd.merge_asof(df1, df2, left_on=['ID','date_time'], right_on=['ID','date_time'])

where IDwill match exactly, but date_timewill be "close matches."

Any help is greatly appreciated.

+4

python merge pandas

pshep123 Feb 17 '17 at 2:11

source share

1 answer

Parfait · Accepted Answer · 2017-02-18T02:51:51+0000

Consider merging first on ID, and then run DataFrame.applyto return the highest time from the first data frame by matching identifiers less than the current date_time line from the second data frame.

# INITIAL MERGE (CROSS-PRODUCT OF ALL ID PAIRINGS)
mdf = pd.merge(df1, df2, on=['ID'])

def f(row):
    col = mdf[(mdf['ID'] == row['ID']) & 
              (mdf['date_time_x'] < row['date_time_y'])]['date_time_x'].max()
    return col

# FILTER BY MATCHED DATES TO CONDITIONAL MAX
mdf = mdf[mdf['date_time_x'] ==  mdf.apply(f, axis=1)].reset_index(drop=True)

It is assumed that you want to keep all df2 lines (i.e. the right join). Just flip the _x / _y suffixes for the left join.

Pandas: union with exact identifier and nearest date

More articles: