Python: pandas merge multiple data frames

I have different data frames and I need to combine them based on a date column. If I only had two data frames, I could use df1.merge(df2, on='date') to do this with three data frames, I would use df1.merge(df2.merge(df3, on='date'), on='date') , however it becomes really complex and unreadable to do this with multiple data frames.

All data frames have one common column - date , but they do not have the same number of rows and columns, and I need only those rows in which each date is common for each data frame.

So, I'm trying to write a recursive function that returns a data frame with all the data, but that didn't work. How then to combine several data frames?

I tried in many ways and got errors like out of range , keyerror 0/1/2/3 and can not merge DataFrame with instance of type <class 'NoneType'> .

This is the script I wrote:

 dfs = [df1, df2, df3] # list of dataframes def mergefiles(dfs, countfiles, i=0): if i == (countfiles - 2): # it gets to the second to last and merges it with the last return dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date') return dfm print(mergefiles(dfs, len(dfs))) 

Example: df_1:

 May 19, 2017;1,200.00;0.1% May 18, 2017;1,100.00;0.1% May 17, 2017;1,000.00;0.1% May 15, 2017;1,901.00;0.1% 

df_2:

 May 20, 2017;2,200.00;1000000;0.2% May 18, 2017;2,100.00;1590000;0.2% May 16, 2017;2,000.00;1230000;0.2% May 15, 2017;2,902.00;1000000;0.2% 

df_3:

 May 21, 2017;3,200.00;2000000;0.3% May 17, 2017;3,100.00;2590000;0.3% May 16, 2017;3,000.00;2230000;0.3% May 15, 2017;3,903.00;2000000;0.3% 

Expected Merger Result:

 May 15, 2017; 1,901.00;0.1%; 2,902.00;1000000;0.2%; 3,903.00;2000000;0.3% 
+37
source share
8 answers

The following is the easiest and most intuitive way to combine multiple data frames if complex queries are not used.

Just combine with DATE as an index and combine using the OUTER method (to get all the data).

 import pandas as pd from functools import reduce df1 = pd.read_table('file1.csv', sep=',') df2 = pd.read_table('file2.csv', sep=',') df3 = pd.read_table('file3.csv', sep=',') 

So basically upload all the files you have as a data frame. Then merge the files using the merge or reduce function.

 # compile the list of dataframes you want to merge data_frames = [df1, df2, df3] 

You can add as many data frames in the code above. This is a good part of this method. No complicated queries.

To keep the values ​​related to the same date, you need to combine them in DATE

 df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'], how='outer'), data_frames) # if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'], how='outer'), data_frames).fillna('void') 
  • Thus, values ​​from the same date are on the same lines.
  • You can fill in non-existent data from different frames for different columns using fillna ().

Then write the combined data to a CSV file, if necessary.

 pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False) 

That should give you

DATE VALUE1 VALUE2 VALUE3....

+49
source

It looks like the data has the same columns, so you can:

 df1 = pd.DataFrame(data1) df2 = pd.DataFrame(data2) merged_df = pd.concat([df1, df2]) 
+9
source

functools.reduce and pd.concat are good solutions, but in terms of runtime pd.concat is the best.

 from functools import reduce import pandas as pd dfs = [df1, df2, df3, ...] nan_value = 0 # solution 1 (fast) result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value) # solution 2 result_2 = reduce(lambda left,right: pd.merge(df_left, df_right, left_index=True, right_index=True, how='outer'), dfs).fillna(nan_value) 
+5
source

There are 2 solutions for this, but they return all columns separately:

 import functools dfs = [df1, df2, df3] df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs) print (df_final) date a_x b_x a_y b_y c_x ab c_y 0 May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2% k = np.arange(len(dfs)).astype(str) df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k) df.columns = df.columns.map('_'.join) print (df) 0_a 0_b 1_a 1_b 1_c 2_a 2_b 2_c date May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2% 
+3
source

@ This answer is correct. pd.concat naturally joins the index columns if you set the axis option to 1. By default, an outer join is used, but you can also specify an inner join. Here is an example:

 x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]}) x.set_index(['a','b'], inplace=True) x.sort_index(inplace=True) y = x.__deepcopy__() y.loc[(14,14),:] = [3,1] y['other']=range(0,11) y.sort_values('val', inplace=True) z = x.__deepcopy__() z.loc[(15,15),:] = [3,4] z['another']=range(0,22,2) z.sort_values('val2',inplace=True) pd.concat([x,y,z],axis=1) 
+3
source

If you filter by total date, this will return it:

 dfs = [df1, df2, df3] checker = dfs[-1] check = set(checker.loc[:, 0]) for df in dfs[:-1]: check = check.intersection(set(df.loc[:, 0])) print(checker[checker.loc[:, 0].isin(check)]) 
+1
source

Thanks for your help @jezrael , @zipa and @ everestial007 , both answers are what I need. If I wanted to create a recursive, this would also work as intended:

 def mergefiles(dfs=[], on=''): """Merge a list of files based on one column""" if len(dfs) == 1: return "List only have one element." elif len(dfs) == 2: df1 = dfs[0] df2 = dfs[1] df = df1.merge(df2, on=on) return df # Merge the first and second datafranes into new dataframe df1 = dfs[0] df2 = dfs[1] df = dfs[0].merge(dfs[1], on=on) # Create new list with merged dataframe dfl = [] dfl.append(df) # Join lists dfl = dfl + dfs[2:] dfm = mergefiles(dfl, on) return dfm 
0
source

Look at this panda, the three way joining of multiple data frames in columns.

 filenames = ['fn1', 'fn2', 'fn3', 'fn4',....] dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)] dfs[0].join(dfs[1:]) 
0
source

Source: https://habr.com/ru/post/1268467/


All Articles