Pandas: combine multiple data frames and control column names?

I would like to merge nine Pandas frames together into a single data frame by combining in two columns, managing the column names. Is it possible?

I have nine data sets. They all have the following columns:

org, name, items,spend 

I want to combine them into a single data block with the following columns:

 org, name, items_df1, spend_df1, items_df2, spend_df2, items_df3... 

I read the merge and join documentation. Currently, I can combine the two datasets as follows:

 ad = pd.DataFrame.merge(df_presents, df_trees, on=['practice', 'name'], suffixes=['_presents', '_trees']) 

This works fine, doing print list(aggregate_data.columns.values) shows me the following columns:

 [org', u'name', u'spend_presents', u'items_presents', u'spend_trees', u'items_trees'...] 

But how can I do this for nine columns? merge only seems to take two at a time, and if I do this one by one, my column names will be very dirty.

+5
source share
3 answers

You can use functools.reduce to iteratively apply pd.merge to each of the DataFrames:

 result = functools.reduce(merge, dfs) 

It is equivalent

 result = dfs[0] for df in dfs[1:]: result = merge(result, df) 

To pass the argument on=['org', 'name'] , you can use functools.partial define a merge function:

 merge = functools.partial(pd.merge, on=['org', 'name']) 

Since specifying the suffixes parameter in functools.partial would allow one fixed choice of suffix, and since we need a different suffix for each pd.merge , I think it would be easier to prepare a column of DataFrames names before calling pd.merge :

 for i, df in enumerate(dfs, start=1): df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, inplace=True) 

For instance,

 import pandas as pd import numpy as np import functools np.random.seed(2015) N = 50 dfs = [pd.DataFrame(np.random.randint(5, size=(N,4)), columns=['org', 'name', 'items', 'spend']) for i in range(9)] for i, df in enumerate(dfs, start=1): df.rename(columns={col:'{}_df{}'.format(col, i) for col in ('items', 'spend')}, inplace=True) merge = functools.partial(pd.merge, on=['org', 'name']) result = functools.reduce(merge, dfs) print(result.head()) 

gives

  org name items_df1 spend_df1 items_df2 spend_df2 items_df3 \ 0 2 4 4 2 3 0 1 1 2 4 4 2 3 0 1 2 2 4 4 2 3 0 1 3 2 4 4 2 3 0 1 4 2 4 4 2 3 0 1 spend_df3 items_df4 spend_df4 items_df5 spend_df5 items_df6 \ 0 3 1 0 1 0 4 1 3 1 0 1 0 4 2 3 1 0 1 0 4 3 3 1 0 1 0 4 4 3 1 0 1 0 4 spend_df6 items_df7 spend_df7 items_df8 spend_df8 items_df9 spend_df9 0 3 4 1 3 0 1 2 1 3 4 1 3 0 0 3 2 3 4 1 3 0 0 0 3 3 3 1 3 0 1 2 4 3 3 1 3 0 0 3 
+5
source

Would make a big pd.concat() and then rename all the columns for you? Sort of:

 desired_columns = ['items', 'spend'] big_df = pd.concat([df1, df2[desired_columns], ..., dfN[desired_columns]], axis=1) new_columns = ['org', 'name'] for i in range(num_dataframes): new_columns.extend(['spend_df%i' % i, 'items_df%i' % i]) bid_df.columns = new_columns 

This should give you columns like:

org, name, spend_df0, items_df0, spend_df1, items_df1, ..., spend_df8, items_df8

0
source

I wanted this too, but couldn't find the pandas built-in way. Here is my suggestion (and my plan the next time I need it):

  • Create an empty merge_dict dictionary.
  • Scroll through the index that you want for each of your data frames, and add the desired words to the dictionary with the index as the key.
  • Create a new index as sorted(merge_dict) .
  • Create a new data list for each column by going through merge_dict.items ().
  • Create a new data frame with index=sorted(merge_dict) and the columns created in the previous step.

Basically, it looks like a hash join in SQL. It seems to be the most effective way that I can think of, and should not linger too long.

Good luck.

0
source

Source: https://habr.com/ru/post/1238541/


All Articles