panda three way joining multiple data frames on columns

Question

panda three way joining multiple data frames on columns

I have 3 CSV files. Each of them has the first column as the (string) names of people, and all other columns in each data frame are attributes of this person.

How can I "combine" all three CSV documents to create a single CSV with each row that has all the attributes for each unique value of the user row name?

The join() function in pandas indicates that I need a multi-index, but I'm confused by the fact that the hierarchical indexing scheme has to do with creating a connection based on a single index.

+153

python merge join pandas

lollercoaster May 15 '14 at 2:51

source share

9 answers

You can try this if you have 3 data frames

 # Merge multiple dataframes df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')

alternatively as cwharland mentioned

 df1.merge(df2,on='name').merge(df3,on='name')

+91

Zero May 15 '14 at 7:04

source share

This is the ideal situation for the `join` method.

The join method is built specifically for these types of situations. You can join any number of DataFrames with it. The caller of the DataFrame is connected to the index of the collection of transmitted DataFrames. To work with multiple DataFrames, you must put the connection columns in the index.

The code looks something like this:

 filenames = ['fn1', 'fn2', 'fn3', 'fn4',....] dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)] dfs[0].join(dfs[1:])

Using @zero data you can do this:

 df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) dfs = [df1, df2, df3] dfs = [df.set_index('name') for df in dfs] dfs[0].join(dfs[1:]) attr11 attr12 attr21 attr22 attr31 attr32 name a 5 9 5 19 15 49 b 4 61 14 16 4 36 c 24 9 4 9 14 9

+53

Ted Petrou Nov 06 '17 at 22:04

source share

This can also be done as follows for a list of df_list data df_list :

 df = df_list[0] for df_ in df_list[1:]: df = df.merge(df_, on='join_col_name')

or if data frames are in the generator object (for example, to reduce memory consumption):

 df = next(df_list) for df_ in df_list: df = df.merge(df_, on='join_col_name')

+17

AlexG Oct 25 '16 at 10:01

source share

In python 3.6.3 with pandas 0.22.0 you can also use concat if you specify the columns you want to use as the index for the join

 pd.concat( (iDF.set_index('name') for iDF in [df1, df2, df3]), axis=1, join='inner' ).reset_index()

where df1 , df2 and df3 defined as in John Galt's answer

 import pandas as pd df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12'] ) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22'] ) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32'] )

+8

Igor Fobia Aug 09 '18 at 15:42

source share

Join does not require multiindex to work. You just need to set the index column correctly to perform the join operations (for example, the df.set_index('Name') command)

The join operation defaults to index. In your case, you just need to indicate that the Name column matches your index. Below is an example

A tutorial can be helpful.

 # Simple example where dataframes index are the name on which to perform the join operations import pandas as pd import numpy as np name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia'] df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name) df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name) df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name) df = df1.join(df2) df = df.join(df3) # If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index # 1) Create a column 'Name' based on the previous index df1['Name']=df1.index # 1) Select the index from column 'Name' df1=df1.set_index('Name') # If indexes are different, one may have to play with parameter how gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8)) gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10)) gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12)) gf = gf1.join(gf2, how='outer') gf = gf.join(gf3, how='outer')

+4

Guillaume Jacquenot May 15 '14 at 7:26

source share

The following is a method for merging a data frame dictionary, in which the column names are synchronized with the dictionary. Also, if necessary, it fills in the missing values:

This is a function for combining data from data frames.

 def MergeDfDict(dfDict, onCols, how='outer', naFill=None): keys = dfDict.keys() for i in range(len(keys)): key = keys[i] df0 = dfDict[key] cols = list(df0.columns) valueCols = list(filter(lambda x: x not in (onCols), cols)) df0 = df0[onCols + valueCols] df0.columns = onCols + [(s + '_' + key) for s in valueCols] if (i == 0): outDf = df0 else: outDf = pd.merge(outDf, df0, how=how, on=onCols) if (naFill != None): outDf = outDf.fillna(naFill) return(outDf)

OK, allows you to generate data and verify this:

 def GenDf(size): df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True), 'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 'col1':np.random.uniform(low=0.0, high=100.0, size=size), 'col2':np.random.uniform(low=0.0, high=100.0, size=size) }) df = df.sort_values(['categ2', 'categ1', 'col1', 'col2']) return(df) size = 5 dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)} MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)

+4

rz1317 Apr 18 '17 at 22:07 on

source share

There is another solution from the pandas documentation (which I do not see here),

using .append

 >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) AB 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) AB 0 5 6 1 7 8 >>> df.append(df2, ignore_index=True) AB 0 1 2 1 3 4 2 5 6 3 7 8

ignore_index=True used to ignore the index of the added data frame, replacing it with the next index available in the source.

If there are different column names, Nan will be entered.

+2

Sylhare Apr 05 '18 at 15:15

source share

A simple solution:

If the column names are similar:

  df1.merge(df2,on='col_name').merge(df3,on='col_name')

If the column names are different:

 df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'})

0

Gil Baggio May 14 '19 at 9:30

source share

Kit · Accepted Answer · 2015-05-28 17:08

Estimated Import:

 import pandas as pd

John Galt's answer is basically a reduce operation. If I had more than a few data frames, I would put them in a list like this (generated through lists, loops, or much more):

 dfs = [df0, df1, df2, dfN]

Assuming they have some kind of common column, like name in your example, I would do the following:

 df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)

Thus, your code should work with any number of data frames that you want to combine.

Edit August 1, 2016: For those using Python 3: reduce been ported to functools . Therefore, to use this function, you first need to import this module:

 from functools import reduce

panda three way joining multiple data frames on columns

This is the ideal situation for the join method.

This is a function for combining data from data frames.

OK, allows you to generate data and verify this:

More articles:

This is the ideal situation for the `join` method.