Python pandas conditional total

Question

Python pandas conditional total

Consider my df data file

 data data_binary sum_data 2 1 1 5 0 0 1 1 1 4 1 2 3 1 3 10 0 0 7 0 0 3 1 1

I want to calculate the total amount of data_binary inside groups of adjacent values of 1 .

The first group 1 had one 1 , and sum_data has only a 1 . However, the second group 1 has 3 1 and sum_data is [1, 2, 3] .

I tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0) but returns

 array([1, 0, 2, 3, 4, 0, 0, 5])

This is not what I want.

+8

python-3.x pandas dataframe ipython

Grayhash Jan 2 '17 at 2:44

source share

3 answers

piRSquared · Answer 1 · 2017-01-02T07:48:08+0000

you want to take the cumulative amount of data_binary and subtract the most recent total amount, where data_binary is zero.

 b = df.data_binary c = b.cumsum() c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) 0 1 1 0 2 1 3 2 4 3 5 0 6 0 7 1 Name: data_binary, dtype: int64

Explanation

Let's start by looking at each step side by side.

 cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result'] print(pd.concat([ b, c, c.mask(b != 0), c.mask(b != 0).ffill(), c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int) ], axis=1, keys=cols)) data_binary cumulative_sum nan_non_zero forward_fill final_result 0 1 1 NaN NaN 1 1 0 1 1.0 1.0 0 2 1 2 NaN 1.0 1 3 1 3 NaN 1.0 2 4 1 4 NaN 1.0 3 5 0 4 4.0 4.0 0 6 0 4 4.0 4.0 0 7 1 5 NaN 4.0 1

The problem with cumulative_sum is that the rows where data_binary is zero do not reset the sum. And this is the motivation for this decision. How do we “reset” the amount when data_binary is zero? Easy! I slice the cumulative sum, where data_binary is zero, and forward - fill in the values. When I take the difference between this and the total amount, I effectively reset the amount.

jezrael · Answer 2 · 2017-01-02T07:35:20+0000

I think you can groupby with DataFrameGroupBy.cumsum by Series , where by first comparing the next value by the shift column ed if not ( != ), And then create groups by cumsum . Last replace 0 with column data_binary mask :

 print (df.data_binary.ne(df.data_binary.shift()).cumsum()) 0 1 1 2 2 3 3 3 4 3 5 4 6 4 7 5 Name: data_binary, dtype: int32 df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum()) .cumsum() df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0) print (df) data data_binary sum_data sum_data1 0 2 1 1 1 1 5 0 0 0 2 1 1 1 1 3 4 1 2 2 4 3 1 3 3 5 10 0 0 0 6 7 0 0 0 7 3 1 1 1

gibbone · Answer 3 · 2019-05-09T09:43:33+0000

If you want an excellent piRSquared response in just one command:

 df['sum_data'] = df[['data_binary']].apply( lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), axis=0)

Note that the square bracket on the right side is needed to create a DataFrame with one column instead of Series in order to use apply with the axis argument (which is not available when apply applied to Series).

Python pandas conditional total

More articles: