Pandas: create a new data framework that averages duplicates from another data frame

Say I have a dataframe my_df with duplicate columns, e..g

 foo bar foo hello 0 1 1 5 1 1 2 5 2 1 3 5 

I would like to create another data frame that averages duplicates:

 foo bar hello 0.5 1 5 1.5 1 5 2.5 1 5 

How to do it in Pandas?

So far, I have been able to identify duplicates:

 my_columns = my_df.columns my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1] 

I do not know how to query Pandas to average them.

+6
source share
1 answer

You can groupby the column index and take mean :

 In [11]: df.groupby(level=0, axis=1).mean() Out[11]: bar foo hello 0 1 0.5 5 1 1 1.5 5 2 1 2.5 5 

A slightly more complex example: if there is no numeric column:

 In [21]: df Out[21]: foo bar foo hello 0 0 1 1 a 1 1 1 2 a 2 2 1 3 a 

The above will raise: DataError: No numeric types to aggregate . Definitely not going to win any performance prizes, but here is a general way to do it in this case:

 In [22]: dupes = df.columns.get_duplicates() In [23]: dupes Out[23]: ['foo'] In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes}) Out[24]: bar hello 0 1 a 1 1 a 2 1 a In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean() Out[25]: foo 0 0.5 1 1.5 2 2.5 In [26]: pd.concat([Out[24], Out[25]], axis=1) Out[26]: foo bar hello 0 0.5 1 a 1 1.5 1 a 2 2.5 1 a 

I think the thing that needs to be removed is to avoid duplicate columns ... or maybe I don't know what I'm doing.

+5
source

Source: https://habr.com/ru/post/945519/


All Articles