Grouped Total

I want to create a moving grouped aggregate amount. I can get the result using iteration, but would like to know if there was a more reasonable way.

Here's what the raw data looks like:

Per C   V
1   c   3
1   a   4
1   c   1
2   a   6
2   b   5
3   j   7
4   x   6
4   x   5
4   a   9
5   a   2
6   c   3
6   k   6

Here is the desired result:

Per C   V
1   c   4
1   a   4
2   c   4
2   a   10
2   b   5
3   c   4
3   a   10
3   b   5
3   j   7
4   c   4
4   a   19
4   b   5
4   j   7
4   x   11
5   c   4
5   a   21
5   b   5
5   j   7
5   x   11
6   c   7
6   a   21
6   b   5
6   j   7
6   x   11
6   k   6
+4
source share
3 answers

This is a very interesting problem. Try below to see if it works for you.

(
    pd.concat([df.loc[df.Per<=i][['C','V']].assign(Per=i) for i in df.Per.unique()])
    .groupby(by=['Per','C'])
    .sum()
    .reset_index()
)

Out[197]: 
    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11
+2
source

You can use pivot_table/cumsum:

(df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
   .fillna(0) 
   .cumsum(axis=0)
   .replace(0, np.nan)
   .stack().reset_index())

gives

    Per  C     0
0     1  a   4.0
1     1  c   4.0
2     2  a  10.0
3     2  b   5.0
4     2  c   4.0
5     3  a  10.0
6     3  b   5.0
7     3  c   4.0
8     3  j   7.0
9     4  a  19.0
10    4  b   5.0
11    4  c   4.0
12    4  j   7.0
13    4  x  11.0
14    5  a  21.0
15    5  b   5.0
16    5  c   4.0
17    5  j   7.0
18    5  x  11.0
19    6  a  21.0
20    6  b   5.0
21    6  c   7.0
22    6  j   7.0
23    6  k   6.0
24    6  x  11.0

On the plus side, I think the approach pivot_table/cumsumconveys the meaning of the calculation very well. Given the value of pivot_table, the calculation is essentially the sum of each column:

In [131]: df.pivot_table(index='Per', columns='C', values='V', aggfunc='sum')
Out[131]: 
C      a    b    c    j    k     x
Per                               
1    4.0  NaN  4.0  NaN  NaN   NaN
2    6.0  5.0  NaN  NaN  NaN   NaN
3    NaN  NaN  NaN  7.0  NaN   NaN
4    9.0  NaN  NaN  NaN  NaN  11.0
5    2.0  NaN  NaN  NaN  NaN   NaN
6    NaN  NaN  3.0  NaN  6.0   NaN

0 NaN . 0 cumsum, NaN, DataFrame.

pivot_table/cumsum using_concat, piRSquared . 1000 df:

In [169]: %timeit using_reindex2(df)
100 loops, best of 3: 6.86 ms per loop

In [152]: %timeit using_reindex(df)
100 loops, best of 3: 8.36 ms per loop

In [80]: %timeit using_pivot(df)
100 loops, best of 3: 8.58 ms per loop

In [79]: %timeit using_concat(df)
10 loops, best of 3: 84 ms per loop

, :

import numpy as np
import pandas as pd

def using_pivot(df):
    return (df.pivot_table(index='P', columns='C', values='V', aggfunc='sum')
             .fillna(0)
             .cumsum(axis=0)
             .replace(0, np.nan)
             .stack().reset_index())

def using_reindex(df):
    """
    /questions/1694428/rolling-grouped-cumulative-sum/4905864#4905864 (piRSquared)
    """
    s = df.set_index(['P', 'C']).V.sum(level=[0, 1])

    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_reindex2(df):
    """
    /questions/1694428/rolling-grouped-cumulative-sum/4905864#4905864 (piRSquared)
    with first line changed
    """
    s = df.groupby(['P', 'C'])['V'].sum()
    return s.reindex(
        pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
        fill_value=0
    ).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

def using_concat(df):
    """
    https://stackoverflow.com/a/49095139/190597 (Allen)
    """
    return (pd.concat([df.loc[df.P<=i][['C','V']].assign(P=i) 
                       for i in df.P.unique()])
              .groupby(by=['P','C'])
              .sum()
              .reset_index())

def make(nrows):
    df = pd.DataFrame(np.random.randint(50, size=(nrows,3)), columns=list('PCV'))
    return df

df = make(1000)
+3

If you set the index 'Per'and 'C', you can accumulate these indexes first. Then I decided to reindex the series obtained by the product with index levels in order to get all the possibilities when filling new indexes with zero.

After that I use groupby, cumsumand delete the zeros.

s = df.set_index(['Per', 'C']).V.sum(level=[0, 1])

s.reindex(
    pd.MultiIndex.from_product(s.index.levels, names=s.index.names),
    fill_value=0
).groupby('C').cumsum().loc[lambda x: x > 0].reset_index()

    Per  C   V
0     1  a   4
1     1  c   4
2     2  a  10
3     2  b   5
4     2  c   4
5     3  a  10
6     3  b   5
7     3  c   4
8     3  j   7
9     4  a  19
10    4  b   5
11    4  c   4
12    4  j   7
13    4  x  11
14    5  a  21
15    5  b   5
16    5  c   4
17    5  j   7
18    5  x  11
19    6  a  21
20    6  b   5
21    6  c   7
22    6  j   7
23    6  k   6
24    6  x  11
+3
source

Source: https://habr.com/ru/post/1694428/


All Articles