Merge pandas dataframe with key duplicates

Question

Merge pandas dataframe with key duplicates

I have 2 data frames, both have a key column that can have duplicates, but in numeric frames basically have the same duplicate keys. I would like to combine this data on this key, but in such a way that, with the same duplication, these duplicates will be combined accordingly. Also, if one data block has more duplicate keys than another, I would like the values to be filled as NaN. For instance:

df1 = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K2', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']}, columns=['key', 'A']) df2 = pd.DataFrame({'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6'], 'key': ['K0', 'K1', 'K2', 'K2', 'K3', 'K3', 'K4']}, columns=['key', 'B']) key A 0 K0 A0 1 K1 A1 2 K2 A2 3 K2 A3 4 K2 A4 5 K3 A5 key B 0 K0 B0 1 K1 B1 2 K2 B2 3 K2 B3 4 K3 B4 5 K3 B5 6 K4 B6

I am trying to get the following output

  key AB 0 K0 A0 B0 1 K1 A1 B1 2 K2 A2 B2 3 K2 A3 B3 6 K2 A4 NaN 8 K3 A5 B4 9 K3 NaN B5 10 K4 NaN B6

So basically, I would like to handle the duplicated K2 keys as K2_1, K2_2, ... and then do how = 'external' merge on dataframes. Any ideas how I can do this?

+1

python merge pandas dataframe

dcmm88 Nov 13 '16 at 15:26

source share

1 answer

piRSquared · Accepted Answer · 2016-11-13T15:30:31+0000

faster

 %%cython # using cython in jupyter notebook # in another cell run `%load_ext Cython` from collections import defaultdict import numpy as np def cg(x): cnt = defaultdict(lambda: 0) for j in x.tolist(): cnt[j] += 1 yield cnt[j] def fastcount(x): return [i for i in cg(x)] df1['cc'] = fastcount(df1.key.values) df2['cc'] = fastcount(df2.key.values) df1.merge(df2, how='outer').drop('cc', 1)

quick response; not scalable

 def fastcount(x): unq, inv = np.unique(x, return_inverse=1) m = np.arange(len(unq))[:, None] == inv return (m.cumsum(1) * m).sum(0) df1['cc'] = fastcount(df1.key.values) df2['cc'] = fastcount(df2.key.values) df1.merge(df2, how='outer').drop('cc', 1)

old answer

 df1['cc'] = df1.groupby('key').cumcount() df2['cc'] = df2.groupby('key').cumcount() df1.merge(df2, how='outer').drop('cc', 1)

Merge pandas dataframe with key duplicates

More articles: