Using groupby and applying to add a column to each group

I have a DataFrame with multiple indexes as a column. I want to group by level 1 and apply a function that generates a new column. I want this calculated column to be added to each group, and thus my database has new columns for each group.

I made a small mannequin script and function to reproduce what I want to do.

import pandas as pd import numpy as np columns = [('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')] columns = pd.MultiIndex.from_tuples(columns) dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns) def addColumn(inputDF): group = inputDF.columns[0][1] inputDF['sum', group] = inputDF.sum(axis=1) return inputDF newColumnsDataframe = dataframe.groupby(level=1, axis=1).apply(addColumn) 

The original framework is as follows:

  ABC julian geoffrey julian geoffrey julian geoffrey 0 0.204082 0.073676 0.795725 0.279702 0.258185 0.258112 1 0.263235 0.096733 0.507324 0.541198 0.525919 0.757652 2 0.196243 0.028613 0.653408 0.364365 0.174911 0.924733 3 0.528785 0.831569 0.654160 0.738029 0.940831 0.294473 4 0.853517 0.263250 0.803087 0.855270 0.701937 0.264698 5 0.239797 0.069519 0.943544 0.374411 0.189361 0.846647 6 0.980734 0.290414 0.850097 0.873785 0.903645 0.118713 7 0.591942 0.088387 0.566298 0.062140 0.568482 0.872064 8 0.818167 0.061483 0.282050 0.008404 0.449198 0.658370 9 0.217424 0.427602 0.471933 0.171458 0.390549 0.234426 

The resulting frame should look like this (I built the sum of the DataFrame separately and combined the two data frames to achieve these results):

  ABC sum ABC \ geoffrey geoffrey geoffrey geoffrey julian julian julian 0 0.073676 0.279702 0.258112 0.611491 0.204082 0.795725 0.258185 1 0.096733 0.541198 0.757652 1.395584 0.263235 0.507324 0.525919 2 0.028613 0.364365 0.924733 1.317710 0.196243 0.653408 0.174911 3 0.831569 0.738029 0.294473 1.864071 0.528785 0.654160 0.940831 4 0.263250 0.855270 0.264698 1.383219 0.853517 0.803087 0.701937 5 0.069519 0.374411 0.846647 1.290578 0.239797 0.943544 0.189361 6 0.290414 0.873785 0.118713 1.282912 0.980734 0.850097 0.903645 7 0.088387 0.062140 0.872064 1.022590 0.591942 0.566298 0.568482 8 0.061483 0.008404 0.658370 0.728257 0.818167 0.282050 0.449198 9 0.427602 0.171458 0.234426 0.833486 0.217424 0.471933 0.390549 sum julian 0 1.257992 1 1.296478 2 1.024561 3 2.123776 4 2.358542 5 1.372703 6 2.734476 7 1.726721 8 1.549415 9 1.079906 

The approach in the script above is based on what makes sense to me and what others have written on the Internet about how to do this. However, newColumnsDataframe still has only 6 columns, not 8 (one for each name).

I noticed that when I group by level = 0 (therefore by A, B or C) and use conversion (but NOT, when I use apply at this level), newColumnsDataframe has 9 columns, one sum column is added for each group. See code below:

 import pandas as pd import numpy as np columns = [('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')] columns = pd.MultiIndex.from_tuples(columns) dataframe = pd.DataFrame(data=np.random.rand(10,6),columns=columns) def addColumn(inputDF): group = inputDF.columns[0][1] inputDF[group, 'sum'] = inputDF.sum(axis=1) return inputDF newColumnsDataframe = dataframe.groupby(level=0, axis=1).transform(addColumn) 

I always understood that the transformation works on each column within the group, while the application is applied to the group as a single data frame. This is contrary to this. I also noticed that when I group by level = 1 and apply the transformation instead, it causes the following error:

 ValueError: Length mismatch: Expected axis has 10 elements, new values have 6 elements 

I am very confused about what is happening. Does anyone know why this works when I use transform and group at level = 0. Why an error occurs when I do the same thing, but group at level = 1. And why grouping at the EITHER level and applying this function do not add columns in my last framework? Thanks in advance!

(PS: This is not the actual DataFrame or the function I use to add the column, just a simpler illustration)

+5
source share
1 answer

Kind of messy but single-line:

 (df.join(pd.concat({'sum': df.groupby(level=1, axis=1).sum()}, axis=1)) .sortlevel(level=1, axis=1)) 

Produces this for me:

  ABC sum ABC \ geoffrey geoffrey geoffrey geoffrey julian julian julian 0 0.073676 0.279702 0.258112 0.611490 0.204082 0.795725 0.258185 1 0.096733 0.541198 0.757652 1.395583 0.263235 0.507324 0.525919 2 0.028613 0.364365 0.924733 1.317711 0.196243 0.653408 0.174911 3 0.831569 0.738029 0.294473 1.864071 0.528785 0.654160 0.940831 4 0.263250 0.855270 0.264698 1.383218 0.853517 0.803087 0.701937 5 0.069519 0.374411 0.846647 1.290577 0.239797 0.943544 0.189361 6 0.290414 0.873785 0.118713 1.282912 0.980734 0.850097 0.903645 7 0.088387 0.062140 0.872064 1.022591 0.591942 0.566298 0.568482 8 0.061483 0.008404 0.658370 0.728257 0.818167 0.282050 0.449198 9 0.427602 0.171458 0.234426 0.833486 0.217424 0.471933 0.390549 sum julian 0 1.257992 1 1.296478 2 1.024562 3 2.123776 4 2.358541 5 1.372702 6 2.734476 7 1.726722 8 1.549415 9 1.079906 

I just said: "Here is my df , first groups the names of people and sums them up, then attach these two summarized columns to the original df , then use sortlevel to sort by level=1 > and axis=1 "

So, the only reason “sum” appears after column C only because the letter s appears after C If you have a column named x , this will not work. Not sure if this is important.

Here df I used for leisure purposes:

 df = pd.DataFrame({ ('C', 'julian'): [0.258185, 0.52591899999999991, 0.17491099999999998, 0.94083099999999997, 0.70193700000000003, 0.189361, 0.90364500000000003, 0.56848199999999993, 0.44919799999999993, 0.39054899999999998], ('B', 'geoffrey'): [0.27970200000000001, 0.54119799999999996, 0.36436499999999999, 0.73802900000000005, 0.85527000000000009, 0.37441099999999999, 0.87378500000000003, 0.062140000000000001, 0.008404, 0.171458], ('A', 'julian'): [0.20408199999999999, 0.263235, 0.196243, 0.52878500000000006, 0.85351699999999997, 0.23979699999999998, 0.98073399999999999, 0.59194199999999997, 0.81816699999999998, 0.21742399999999998], ('B', 'julian'): [0.79572500000000002, 0.507324, 0.65340799999999999, 0.65416000000000007, 0.803087, 0.94354400000000005, 0.85009699999999988, 0.56629799999999997, 0.28205000000000002, 0.47193299999999999], ('A', 'geoffrey'): [0.073676000000000005, 0.096733, 0.028613, 0.831569, 0.26324999999999998, 0.069519000000000011, 0.29041400000000001, 0.088387000000000007, 0.061483000000000003, 0.42760200000000004], ('C', 'geoffrey'): [0.25811200000000001, 0.75765199999999999, 0.92473300000000003, 0.29447299999999998, 0.26469799999999999, 0.84664699999999993, 0.11871300000000001, 0.87206399999999995, 0.65837000000000001, 0.23442600000000002]}, columns=pd.MultiIndex.from_tuples([('A','julian'),('A','geoffrey'), ('B','julian'),('B','geoffrey'), ('C','julian'),('C','geoffrey')])) 

Edited by:

Here's a different approach:

 sum_columns = [('sum', name) for name in df.columns.levels[1].tolist()] df[sum_columns] = df.groupby(axis=1, level=1).sum() df = df.sortlevel(level=1, axis=1) 

sum_columns - it looks like this: [('sum', 'geoffrey'), ('sum', 'julian')] .

df[sum_columns] creates a new column "sum" for each name at level 1.

If you want to have a sum column next to the name, use sortlevel .

+2
source

Source: https://habr.com/ru/post/1245117/


All Articles