Pandas: how to group by character and take the average value for each n lines

Question

Pandas: how to group by character and take the average value for each n lines

Suppose I have the following df data frame:

        date        symbol_a  symbol_b   ratio  
    0    2017/01/01    AAAA       AA       10
    1    2017/01/02    AAAA       AA       20
    2    2017/01/03    AAAA       AA       30
    3    2017/01/04    AAAA       AA       10
    4    2017/01/05    AAAA       AA       10 
    5    2017/01/06    AAAA       AA       10
    6    2017/01/01    BBBB       BB       10
    7    2017/01/02    BBBB       BB       20
    8    2017/01/03    BBBB       BB       30
    9    2017/01/04    BBBB       BB       10
   10    2017/01/01    CCCC       CC       10
   11    2017/01/02    CCCC       CC       20
   12    2017/01/03    CCCC       CC       30
   13    2017/01/04    CCCC       CC       10
   14    2017/01/05    CCCC       CC       10  
   15    2017/01/06    CCCC       CC        5

I'm interested in the average values of the relation column (this comes from the previous data frame, which had two additional columns value_a value_b and ratio = value_a / value_b, more or less). I would like to do the following:

take averages of character a (or _b is actually the same) every n elements. Suppose n = 3.

Usually I would do something like:

df.groupby(['symbol_a','symbol_b']).mean()

However, I would like to receive intermediate funds every 3 days (the actual period of time is obviously much longer, and I will need every 5).

At first I thought that I would always have the same number of characters divisible by n, so I tried something like:

df.groupby([df.index/n, 'symbol_a', 'symbol_b']).mean().reset_index()

, n. , , , , n : , , symbol_a "BBBB" (). , , , .

, -, n , n, n , , ( number_of_symbols <n ).

, (, n = 3):

      symbol_a  symbol_b   3_mean_ratio
0       AAAA       AA          20       
1       AAAA       AA          10
2       BBBB       BB          20   
4       BBBB       BB          10   
5       CCCC       CC          20       
6       CCCC       CC         8.33

- ? , .

: . , , - n-days ratio . , , , . " " . , , . - , :

        date        symbol_a  symbol_b   ratio  n-days-ratio
    0    2017/01/01    AAAA       AA       10      20
    1    2017/01/02    AAAA       AA       20      20
    2    2017/01/03    AAAA       AA       30      20
    3    2017/01/04    AAAA       AA       10      10
    4    2017/01/05    AAAA       AA       10      10
    5    2017/01/06    AAAA       AA       10      10
    6    2017/01/01    BBBB       BB       10      20
    7    2017/01/02    BBBB       BB       20      20
    8    2017/01/03    BBBB       BB       30      20
    9    2017/01/04    BBBB       BB       10      10
   10    2017/01/01    CCCC       CC       10      20
   11    2017/01/02    CCCC       CC       20      20
   12    2017/01/03    CCCC       CC       30      20
   13    2017/01/04    CCCC       CC       10     8.3
   14    2017/01/05    CCCC       CC       10     8.3
   15    2017/01/06    CCCC       CC        5     8.3

+4

python pandas group-by

Tommy 23 '17 2:59

2

cumcount() // 3

cols = ['symbol_a', 'symbol_b']
cc = df.groupby(cols).cumcount() // 3
cols += ['Cumcount']

d1 = df.assign(Cumcount=cc)

d1.groupby(cols).ratio.mean().reset_index('Cumcount', drop=True).reset_index()

  symbol_a symbol_b      ratio
0     AAAA       AA  20.000000
1     AAAA       AA  10.000000
2     BBBB       BB  20.000000
3     BBBB       BB  10.000000
4     CCCC       CC  20.000000
5     CCCC       CC   8.333333

+3

piRSquared 23 '17 3:40

Scott Boston · Accepted Answer · 2017-05-23T03:04:26+0000

n-

g = df.groupby('symbol_a').cumcount()
df['n-days-ratio'] = df.groupby(['symbol_a','symbol_b',g // 3]).transform(lambda x: x.mean())
df

:

          date symbol_a symbol_b  ratio  n-days-ratio
0   2017/01/01     AAAA       AA     10     20.000000
1   2017/01/02     AAAA       AA     20     20.000000
2   2017/01/03     AAAA       AA     30     20.000000
3   2017/01/04     AAAA       AA     10     10.000000
4   2017/01/05     AAAA       AA     10     10.000000
5   2017/01/06     AAAA       AA     10     10.000000
6   2017/01/01     BBBB       BB     10     20.000000
7   2017/01/02     BBBB       BB     20     20.000000
8   2017/01/03     BBBB       BB     30     20.000000
9   2017/01/04     BBBB       BB     10     10.000000
10  2017/01/01     CCCC       CC     10     20.000000
11  2017/01/02     CCCC       CC     20     20.000000
12  2017/01/03     CCCC       CC     30     20.000000
13  2017/01/04     CCCC       CC     10      8.333333
14  2017/01/05     CCCC       CC     10      8.333333
15  2017/01/06     CCCC       CC      5      8.333333

:

~~g = df.groupby('symbol_a') ['ratio']. transform (lambda x: x.astype(bool).cumsum(). add (-1))~~

piRSquare cumcount.

g = df.groupby('symbol_a').cumcount()

df_out = df.groupby(['symbol_a','symbol_b',g // 3]).mean().reset_index(level=2, drop=True).reset_index()

:

  symbol_a symbol_b      ratio
0     AAAA       AA  20.000000
1     AAAA       AA  10.000000
2     BBBB       BB  20.000000
3     BBBB       BB  10.000000
4     CCCC       CC  20.000000
5     CCCC       CC   8.333333

Pandas: how to group by character and take the average value for each n lines

n-

More articles: