Pandas: Collapse the first rows in each group by aggregating

I have a dataframe that is grouped by id. There are many groups, and each group has a variable number of rows. The first three lines of all groups do not contain interesting data. I would like to "collapse" the first three lines in each group to form one line as follows:

'id', and 'type' will remain unchanged in a new line with "collapsed".
"grp_idx" will be renamed "0" when the first three rows are aggregated
col_1 will be the sum of the first three lines
col_2 will be the sum of the first three lines
The โ€œflagโ€ in the โ€œcollapsedโ€ line will be 0 if all values โ€‹โ€‹are 0 in the first three lines. 'flag' will be 1 if it is 1 in any of the first three lines. (A simple amount is sufficient for this logic, since the flag is set on only one line for all groups)

Here is an example of what a dataframe looks like:

import pandas as pd import numpy as np df = pd.DataFrame.from_items([ ('id', [283,283,283,283,283,283,283,756,756,756]), ('type', ['A','A','A','A','A','A','A','X','X','X']), ('grp_idx', [1,2,3,4,5,6,7,1,2,3]), ('col_1', [2,4,6,8,10,12,14,5,10,15]), ('col_2', [3,6,9,12,15,18,21,1,2,3]), ('flag', [0,0,0,0,0,0,1,0,0,1]), ]); print(df) id type grp_idx col_1 col_2 flag 0 283 A 1 2 3 0 1 283 A 2 4 6 0 2 283 A 3 6 9 0 3 283 A 4 8 12 0 4 283 A 5 10 15 0 5 283 A 6 12 18 0 6 283 A 7 14 21 1 7 756 X 1 5 1 0 8 756 X 2 10 2 0 9 756 X 3 15 3 1 

After processing, I expect the data structure to look like this:

 ID Type grp_idx col_1 col_2 flag 283 A 0 12 18 0 283 A 4 8 12 0 283 A 5 10 15 0 283 A 6 12 18 0 283 A 7 14 21 1 756 X 0 30 6 1 

I am not sure how to proceed. I tried to play with

df.groupby ('id'). head (3) .sum ()

but it does not do what I need. Any help, suggestions, code snippets would be really appreciated.

+5
source share
2 answers

You can start by setting grp_idx :

 df["grp_idx"] = np.where(df.groupby("id").cumcount()<3, 0, df["grp_idx"]) 

Now id and grp_idx create the necessary grouping:

 df.groupby(["id", "type", "grp_idx"]).sum().reset_index() id type grp_idx col_1 col_2 flag 0 283 A 0 12 18 0 1 283 A 4 8 12 0 2 283 A 5 10 15 0 3 283 A 6 12 18 0 4 283 A 7 14 21 1 5 756 X 0 30 6 1 

I assumed that the type cannot be different for the same identifier, since you did not give any conditions for this column. I also suggested that df is sorted by id. If not, you can sort it first for grp_idx to be true.

+2
source

I tried to play with

 df.groupby('id').head(3).sum() 

After calling groupby() you need aggregate() to merge the way you want. Try something like this:

 # function to sum the first 3 rows def head_sum(x): return x.head(3).sum() # function to get max of first 3 rows def head_max(x): return x.head(3).max() # We can use a dictionary in `aggregate()` to call a # specific function for each column in the groupby column_funcs = {'col_1': head_sum, 'col_2': head_sum, 'flag': head_max, 'id': max, # all the vals should be the same 'type': max} # are the 'id' and 'type' always matched? collapsed = df.groupby('id').aggregate(column_funcs) collapsed['grp_idx'] = 0 new_df = pd.concat([df, collapsed]) 

See here for more details on the split-apply-comb approach.

+2
source

Source: https://habr.com/ru/post/1246580/


All Articles