How to combine two data frames based on the nearest date

I want to combine two data frames based on two columns: Code and Date. It is too easy to combine data frames based on the "Code", but in the case of the "Date" it becomes complicated - an exact match between the dates in df1 and df2 does not exist. So, I want to choose the coming Dates. How can i do this?

df = df1[column_names1].merge(df2[column_names2], on='Code') 
+5
source share
2 answers

I don't think there is a quick, one-liner way to do this, but I believe the best approach is to do it this way:

  • add a column in df1 with the nearest date from the corresponding group in df2

  • trigger standard merge on these

As your data grows in size, this โ€œnear dateโ€ operation can become quite expensive if you don't do something complicated. I like to use scikit-learn NearestNeighbor for this kind of thing.

I have put together one approach to this solution that should scale relatively well. First, we can generate some simple data:

 import pandas as pd import numpy as np dates = pd.date_range('2015', periods=200, freq='D') rand = np.random.RandomState(42) i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5]) i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5]) df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5), 'Date': dates[i1], 'val1':rand.rand(5)}) df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5), 'Date': dates[i2], 'val2':rand.rand(5)}) 

Let it choose:

 >>> df1 Code Date val1 0 0 2015-01-16 0.975852 1 0 2015-01-31 0.516300 2 1 2015-04-06 0.322956 3 1 2015-05-09 0.795186 4 1 2015-06-08 0.270832 >>> df2 Code Date val2 0 1 2015-02-03 0.184334 1 1 2015-04-13 0.080873 2 0 2015-05-02 0.428314 3 1 2015-06-26 0.688500 4 0 2015-06-30 0.058194 

Now write an apply function that adds a column of nearby dates to df1 using scikit-learn:

 from sklearn.neighbors import NearestNeighbors def find_nearest(group, match, groupname): match = match[match[groupname] == group.name] nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None]) dist, ind = nbrs.kneighbors(group['Date'].values[:, None]) group['Date1'] = group['Date'] group['Date'] = match['Date'].values[ind.ravel()] return group df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code') >>> df1_mod Code Date val1 Date1 0 0 2015-05-02 0.975852 2015-01-16 1 0 2015-05-02 0.516300 2015-01-31 2 1 2015-04-13 0.322956 2015-04-06 3 1 2015-04-13 0.795186 2015-05-09 4 1 2015-06-26 0.270832 2015-06-08 

Finally, we can combine them with a direct call to pd.merge :

 >>> pd.merge(df1_mod, df2, on=['Code', 'Date']) Code Date val1 Date1 val2 0 0 2015-05-02 0.975852 2015-01-16 0.428314 1 0 2015-05-02 0.516300 2015-01-31 0.428314 2 1 2015-04-13 0.322956 2015-04-06 0.080873 3 1 2015-04-13 0.795186 2015-05-09 0.080873 4 1 2015-06-26 0.270832 2015-06-08 0.688500 

Note that lines 0 and 1 match the same val2 ; this is expected, given how you described your desired solution.

+6
source

Here's an alternative solution:

  • Combine the code.

  • Add a date difference column according to your needs (I used abs in the example below) and sort the data using a new column.

  • Group by records of the first data frame and for each group take a record from the second data frame with the nearest date.

code:

 df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code') df['DateDiff'] = (df['Date1'] - df['Date2']).abs() df.sort_values('DateDiff').groupby('index').first().reset_index() 
0
source

Source: https://habr.com/ru/post/1234817/


All Articles