How to combine two data frames based on the nearest date

Question

How to combine two data frames based on the nearest date

I want to combine two data frames based on two columns: Code and Date. It is too easy to combine data frames based on the "Code", but in the case of the "Date" it becomes complicated - an exact match between the dates in df1 and df2 does not exist. So, I want to choose the coming Dates. How can i do this?

df = df1[column_names1].merge(df2[column_names2], on='Code')

+5

python pandas dataframe

Klausos klausos Oct 29 '15 at 18:06

source share

2 answers

Here's an alternative solution:

Combine the code.
Add a date difference column according to your needs (I used abs in the example below) and sort the data using a new column.
Group by records of the first data frame and for each group take a record from the second data frame with the nearest date.

code:

 df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code') df['DateDiff'] = (df['Date1'] - df['Date2']).abs() df.sort_values('DateDiff').groupby('index').first().reset_index()

0

Eyal shulman 18 sept. '16 at 17:07

source share

jakevdp · Accepted Answer · 2015-10-30T14:28:57+0000

I don't think there is a quick, one-liner way to do this, but I believe the best approach is to do it this way:

add a column in df1 with the nearest date from the corresponding group in df2
trigger standard merge on these

As your data grows in size, this “near date” operation can become quite expensive if you don't do something complicated. I like to use scikit-learn NearestNeighbor for this kind of thing.

I have put together one approach to this solution that should scale relatively well. First, we can generate some simple data:

 import pandas as pd import numpy as np dates = pd.date_range('2015', periods=200, freq='D') rand = np.random.RandomState(42) i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5]) i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5]) df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5), 'Date': dates[i1], 'val1':rand.rand(5)}) df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5), 'Date': dates[i2], 'val2':rand.rand(5)})

Let it choose:

 >>> df1 Code Date val1 0 0 2015-01-16 0.975852 1 0 2015-01-31 0.516300 2 1 2015-04-06 0.322956 3 1 2015-05-09 0.795186 4 1 2015-06-08 0.270832 >>> df2 Code Date val2 0 1 2015-02-03 0.184334 1 1 2015-04-13 0.080873 2 0 2015-05-02 0.428314 3 1 2015-06-26 0.688500 4 0 2015-06-30 0.058194

Now write an apply function that adds a column of nearby dates to df1 using scikit-learn:

 from sklearn.neighbors import NearestNeighbors def find_nearest(group, match, groupname): match = match[match[groupname] == group.name] nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None]) dist, ind = nbrs.kneighbors(group['Date'].values[:, None]) group['Date1'] = group['Date'] group['Date'] = match['Date'].values[ind.ravel()] return group df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code') >>> df1_mod Code Date val1 Date1 0 0 2015-05-02 0.975852 2015-01-16 1 0 2015-05-02 0.516300 2015-01-31 2 1 2015-04-13 0.322956 2015-04-06 3 1 2015-04-13 0.795186 2015-05-09 4 1 2015-06-26 0.270832 2015-06-08

Finally, we can combine them with a direct call to pd.merge :

 >>> pd.merge(df1_mod, df2, on=['Code', 'Date']) Code Date val1 Date1 val2 0 0 2015-05-02 0.975852 2015-01-16 0.428314 1 0 2015-05-02 0.516300 2015-01-31 0.428314 2 1 2015-04-13 0.322956 2015-04-06 0.080873 3 1 2015-04-13 0.795186 2015-05-09 0.080873 4 1 2015-06-26 0.270832 2015-06-08 0.688500

Note that lines 0 and 1 match the same val2 ; this is expected, given how you described your desired solution.

How to combine two data frames based on the nearest date

More articles: