panda three way joining multiple data frames on columns

I have 3 CSV files. Each of them has the first column as the (string) names of people, and all other columns in each data frame are attributes of this person.

How can I "combine" all three CSV documents to create a single CSV with each row that has all the attributes for each unique value of the user row name?

The join() function in pandas indicates that I need a multi-index, but I'm confused by the fact that the hierarchical indexing scheme has to do with creating a connection based on a single index.

+153
python merge join pandas
May 15 '14 at 2:51
source share
9 answers

Estimated Import:

 import pandas as pd 

John Galt's answer is basically a reduce operation. If I had more than a few data frames, I would put them in a list like this (generated through lists, loops, or much more):

 dfs = [df0, df1, df2, dfN] 

Assuming they have some kind of common column, like name in your example, I would do the following:

 df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs) 

Thus, your code should work with any number of data frames that you want to combine.

Edit August 1, 2016: For those using Python 3: reduce been ported to functools . Therefore, to use this function, you first need to import this module:

 from functools import reduce 
+405
May 28 '15 at 17:08
source share

You can try this if you have 3 data frames

 # Merge multiple dataframes df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) pd.merge(pd.merge(df1,df2,on='name'),df3,on='name') 

alternatively as cwharland mentioned

 df1.merge(df2,on='name').merge(df3,on='name') 
+91
May 15 '14 at 7:04
source share

This is the ideal situation for the join method.

The join method is built specifically for these types of situations. You can join any number of DataFrames with it. The caller of the DataFrame is connected to the index of the collection of transmitted DataFrames. To work with multiple DataFrames, you must put the connection columns in the index.

The code looks something like this:

 filenames = ['fn1', 'fn2', 'fn3', 'fn4',....] dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)] dfs[0].join(dfs[1:]) 

Using @zero data you can do this:

 df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12']) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22']) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32']) dfs = [df1, df2, df3] dfs = [df.set_index('name') for df in dfs] dfs[0].join(dfs[1:]) attr11 attr12 attr21 attr22 attr31 attr32 name a 5 9 5 19 15 49 b 4 61 14 16 4 36 c 24 9 4 9 14 9 
+53
Nov 06 '17 at 22:04
source share

This can also be done as follows for a list of df_list data df_list :

 df = df_list[0] for df_ in df_list[1:]: df = df.merge(df_, on='join_col_name') 

or if data frames are in the generator object (for example, to reduce memory consumption):

 df = next(df_list) for df_ in df_list: df = df.merge(df_, on='join_col_name') 
+17
Oct 25 '16 at 10:01
source share

In python 3.6.3 with pandas 0.22.0 you can also use concat if you specify the columns you want to use as the index for the join

 pd.concat( (iDF.set_index('name') for iDF in [df1, df2, df3]), axis=1, join='inner' ).reset_index() 

where df1 , df2 and df3 defined as in John Galt's answer

 import pandas as pd df1 = pd.DataFrame(np.array([ ['a', 5, 9], ['b', 4, 61], ['c', 24, 9]]), columns=['name', 'attr11', 'attr12'] ) df2 = pd.DataFrame(np.array([ ['a', 5, 19], ['b', 14, 16], ['c', 4, 9]]), columns=['name', 'attr21', 'attr22'] ) df3 = pd.DataFrame(np.array([ ['a', 15, 49], ['b', 4, 36], ['c', 14, 9]]), columns=['name', 'attr31', 'attr32'] ) 
+8
Aug 09 '18 at 15:42
source share

Join does not require multiindex to work. You just need to set the index column correctly to perform the join operations (for example, the df.set_index('Name') command)

The join operation defaults to index. In your case, you just need to indicate that the Name column matches your index. Below is an example

A tutorial can be helpful.

 # Simple example where dataframes index are the name on which to perform the join operations import pandas as pd import numpy as np name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia'] df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name) df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name) df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name) df = df1.join(df2) df = df.join(df3) # If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index # 1) Create a column 'Name' based on the previous index df1['Name']=df1.index # 1) Select the index from column 'Name' df1=df1.set_index('Name') # If indexes are different, one may have to play with parameter how gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8)) gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10)) gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12)) gf = gf1.join(gf2, how='outer') gf = gf.join(gf3, how='outer') 
+4
May 15 '14 at 7:26
source share

The following is a method for merging a data frame dictionary, in which the column names are synchronized with the dictionary. Also, if necessary, it fills in the missing values:

This is a function for combining data from data frames.

 def MergeDfDict(dfDict, onCols, how='outer', naFill=None): keys = dfDict.keys() for i in range(len(keys)): key = keys[i] df0 = dfDict[key] cols = list(df0.columns) valueCols = list(filter(lambda x: x not in (onCols), cols)) df0 = df0[onCols + valueCols] df0.columns = onCols + [(s + '_' + key) for s in valueCols] if (i == 0): outDf = df0 else: outDf = pd.merge(outDf, df0, how=how, on=onCols) if (naFill != None): outDf = outDf.fillna(naFill) return(outDf) 

OK, allows you to generate data and verify this:

 def GenDf(size): df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True), 'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True), 'col1':np.random.uniform(low=0.0, high=100.0, size=size), 'col2':np.random.uniform(low=0.0, high=100.0, size=size) }) df = df.sort_values(['categ2', 'categ1', 'col1', 'col2']) return(df) size = 5 dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)} MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0) 
+4
Apr 18 '17 at 22:07 on
source share

There is another solution from the pandas documentation (which I do not see here),

using .append

 >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB')) AB 0 1 2 1 3 4 >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB')) AB 0 5 6 1 7 8 >>> df.append(df2, ignore_index=True) AB 0 1 2 1 3 4 2 5 6 3 7 8 

ignore_index=True used to ignore the index of the added data frame, replacing it with the next index available in the source.

If there are different column names, Nan will be entered.

+2
Apr 05 '18 at 15:15
source share

A simple solution:

If the column names are similar:

  df1.merge(df2,on='col_name').merge(df3,on='col_name') 

If the column names are different:

 df1.merge(df2,left_on='col_name1', right_on='col_name2').merge(df3,left_on='col_name1', right_on='col_name3').drop(columns=['col_name2', 'col_name3']).rename(columns={'col_name1':'col_name'}) 
0
May 14 '19 at 9:30
source share



All Articles