How To Merge Two Data Frames Based On Nearest Date
Solution 1:
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to
df1
with the closest date from the appropriate group indf2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor
code for this sort of thing.
I've put together one approach to that solution that should scale relatively well. First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>>df1CodeDateval1002015-01-16 0.975852102015-01-31 0.516300212015-04-06 0.322956312015-05-09 0.795186412015-06-08 0.270832>>>df2CodeDateval2012015-02-03 0.184334112015-04-13 0.080873202015-05-02 0.428314312015-06-26 0.688500402015-06-30 0.058194
Now let's write an apply
function that adds a column of nearest dates to df1
using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match=match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] =group['Date']
group['Date'] =match['Date'].values[ind.ravel()]
returngroup
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
002015-05-020.9758522015-01-16102015-05-020.5163002015-01-31212015-04-130.3229562015-04-06312015-04-130.7951862015-05-09412015-06-260.2708322015-06-08
Finally, we can merge these together with a straightforward call to pd.merge
:
>>>pd.merge(df1_mod,df2,on=['Code','Date'])CodeDateval1Date1val2002015-05-02 0.9758522015-01-16 0.428314102015-05-02 0.5163002015-01-31 0.428314212015-04-13 0.3229562015-04-06 0.080873312015-04-13 0.7951862015-05-09 0.080873412015-06-26 0.2708322015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2
; this is expected given the way you described your desired solution.
Solution 2:
Here's an alternative solution:
Merge on Code.
Add a date difference column according to your need (I used abs in the example below) and sort the data using the new column.
Group by the records of the first data frame and for each group take a record from the second data frame with the closest date.
Code:
df = df1.reset_index()[column_names1].merge(df2[column_names2], on='Code')
df['DateDiff'] = (df['Date1'] - df['Date2']).abs()
df.sort_values('DateDiff').groupby('index').first().reset_index()
Post a Comment for "How To Merge Two Data Frames Based On Nearest Date"