Python pandas conditional total

Consider my df data file

 data data_binary sum_data 2 1 1 5 0 0 1 1 1 4 1 2 3 1 3 10 0 0 7 0 0 3 1 1 

I want to calculate the total amount of data_binary inside groups of adjacent values ​​of 1 .

The first group 1 had one 1 , and sum_data has only a 1 . However, the second group 1 has 3 1 and sum_data is [1, 2, 3] .

I tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0) but returns

 array([1, 0, 2, 3, 4, 0, 0, 5]) 

This is not what I want.

+8
source share
3 answers

you want to take the cumulative amount of data_binary and subtract the most recent total amount, where data_binary is zero.

 b = df.data_binary c = b.cumsum() c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) 0 1 1 0 2 1 3 2 4 3 5 0 6 0 7 1 Name: data_binary, dtype: int64 

Explanation

Let's start by looking at each step side by side.

 cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result'] print(pd.concat([ b, c, c.mask(b != 0), c.mask(b != 0).ffill(), c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) ], axis=1, keys=cols)) data_binary cumulative_sum nan_non_zero forward_fill final_result 0 1 1 NaN NaN 1 1 0 1 1.0 1.0 0 2 1 2 NaN 1.0 1 3 1 3 NaN 1.0 2 4 1 4 NaN 1.0 3 5 0 4 4.0 4.0 0 6 0 4 4.0 4.0 0 7 1 5 NaN 4.0 1 

The problem with cumulative_sum is that the rows where data_binary is zero do not reset the sum. And this is the motivation for this decision. How do we β€œreset” the amount when data_binary is zero? Easy! I slice the cumulative sum, where data_binary is zero, and forward - fill in the values. When I take the difference between this and the total amount, I effectively reset the amount.

+12
source

I think you can groupby with DataFrameGroupBy.cumsum by Series , where by first comparing the next value by the shift column ed if not ( != ), And then create groups by cumsum . Last replace 0 with column data_binary mask :

 print (df.data_binary.ne(df.data_binary.shift()).cumsum()) 0 1 1 2 2 3 3 3 4 3 5 4 6 4 7 5 Name: data_binary, dtype: int32 df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum()) .cumsum() df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0) print (df) data data_binary sum_data sum_data1 0 2 1 1 1 1 5 0 0 0 2 1 1 1 1 3 4 1 2 2 4 3 1 3 3 5 10 0 0 0 6 7 0 0 0 7 3 1 1 1 
+7
source

If you want an excellent piRSquared response in just one command:

 df['sum_data'] = df[['data_binary']].apply( lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), axis=0) 

Note that the square bracket on the right side is needed to create a DataFrame with one column instead of Series in order to use apply with the axis argument (which is not available when apply applied to Series).

0
source

Source: https://habr.com/ru/post/1262155/


All Articles