Combining two pandas <data> data results in duplicate columns

I am trying to combine two data frames that contain the same column. Some of the other columns also have the same headings, although the number of rows is not equal, and after merging, these columns are "duplicated" with the original headings, taking into account postscript _x, _y, etc.

Does anyone know how to get pandas to remove duplicate columns in the example below?

This is my Python code:

import pandas as pd holding_df = pd.read_csv('holding.csv') invest_df = pd.read_csv('invest.csv') merge_df = pd.merge(holding_df, invest_df, on='key', how='left').fillna(0) merge_df.to_csv('merged.csv', index=False) 

And the CSV files contain the following:

First lines of left-dataframe (hold_df)

 key, dept_name, res_name, year, need, holding DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1 DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1 DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1 ... 

Right-dataframe (invest_df)

 key, dept_name, res_name, year, no_of_inv, inv_cost_wo_ice DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1000000 DeptA_ResB_2015, DeptA, ResB, 2015, 2, 6000000 DeptB_ResB_2015, DeptB, ResB, 2015, 1, 6000000 ... 

Combined result

 key, dept_name_x, res_name_x, year_x, need, holding, dept_name_y, res_name_y, year_y, no_of_inv, inv_cost_wo_ice DeptA_ResA_2015, DeptA, ResA, 2015, 1, 1, DeptA, ResA, 2015.0, 1.0, 1000000.0 DeptA_ResA_2016, DeptA, ResA, 2016, 1, 1, 0, 0, 0.0, 0.0, 0.0 DeptA_ResA_2017, DeptA, ResA, 2017, 1, 1, 0, 0, 0.0, 0.0, 0.0 DeptA_ResA_2018, DeptA, ResA, 2018, 1, 1, 0, 0, 0.0, 0.0, 0.0 DeptA_ResA_2019, DeptA, ResA, 2019, 1, 1, 0, 0, 0.0, 0.0, 0.0 ... 
+8
source share
3 answers

The reason you have additional columns with the suffixes "_x" and "_y" is because the columns you combine do not have the corresponding values, so this collision creates additional columns. In this case, you need to discard the additional _y columns and rename the _x columns:

 In [145]: # define our drop function def drop_y(df): # list comprehension of the cols that end with '_y' to_drop = [x for x in df if x.endswith('_y')] df.drop(to_drop, axis=1, inplace=True) drop_y(merged) merged Out[145]: key dept_name_x res_name_x year_x need holding \ 0 DeptA_ResA_2015 DeptA ResA 2015 1 1 1 DeptA_ResA_2016 DeptA ResA 2016 1 1 2 DeptA_ResA_2017 DeptA ResA 2017 1 1 no_of_inv inv_cost_wo_ice 0 1 1000000 1 0 0 2 0 0 In [146]: # func to rename '_x' cols def rename_x(df): for col in df: if col.endswith('_x'): df.rename(columns={col:col.rstrip('_x')}, inplace=True) rename_x(merged) merged Out[146]: key dept_name res_name year need holding no_of_inv \ 0 DeptA_ResA_2015 DeptA ResA 2015 1 1 1 1 DeptA_ResA_2016 DeptA ResA 2016 1 1 0 2 DeptA_ResA_2017 DeptA ResA 2017 1 1 0 inv_cost_wo_ice 0 1000000 1 0 2 0 

EDIT If you added common columns to your merge, then it should not create duplicate columns if matches in these columns do not match:

 merge_df = pd.merge(holding_df, invest_df, on=['key', 'dept_name', 'res_name', 'year'], how='left').fillna(0) 
+4
source

I have the same problem with repeating columns after left join, even when the column data is identical. I made a request and found out that NaN values ​​are considered different, even if both columns are NaN in pandas 0.14. BUT, as soon as you upgrade to 0.15, this problem disappears, which explains why it works later, you probably updated.

+5
source

Not exactly the answer, but pd.merge contains an argument to help you decide which suffixes should be added to your overlapping columns:

 merge_df = pd.merge(holding_df, invest_df, on='key', how='left', suffixes=('_holding', '_invest')).fillna(0) 

More meaningful names can be useful if you decide to keep both (or check why the columns are saved).

See the documentation for more details.

+1
source

Source: https://habr.com/ru/post/979167/


All Articles