Vectorizing Pandas Dataframe Computing

I have a trivial problem that I decided to use with loops, but I'm trying to figure out if there is a way by which I will try to quote some of them in order to try to improve performance.

Essentially, I have 2 data frames (DF_A and DF_B), where the rows in DF_B are based on the summation of the corresponding row in DF_A and the row above in DF_B. I have the first row of values ​​in DF_B.

df_a = [
  [1,2,3,4]
  [5,6,7,8]
  [..... more rows]
]
df_b = [
 [1,2,3,4] 
 [ rows of all 0 values here, so dimensions match df_a]
]

What I'm trying to achieve is that the second line in df_b, for example, will be the values ​​of the first line in df_b + the values ​​of the second line in df_a. So in this case:

df_b.loc[2] = [6,8,10,12] 

, df_a, , . .

+1
2

numpy. , pandas, , JIT- numba.

from numba import jit

a = df_a.values
b = df_b.values

@jit(nopython=True)
def fill_b(a, b):
    for i in range(1, len(b)):
        b[i] = b[i-1] + a[i]
    return b

df_b = pd.DataFrame(fill_b(a, b))

#     0   1   2   3
# 0   1   2   3   4
# 1   6   8  10  12
# 2  15  18  21  24
# 3  28  32  36  40
# 4  45  50  55  60

import pandas as pd, numpy as np
from numba import jit

df_a = pd.DataFrame(np.arange(1,1000001).reshape(1000,1000))

@jit(nopython=True)
def fill_b(a, b):
    for i in range(1, len(b)):
        b[i] = b[i-1] + a[i]
    return b

def jp(df_a):

    a = df_a.values
    b = np.empty(df_a.values.shape)
    b[0] = np.arange(1, 1001)

    return pd.DataFrame(fill_b(a, b))

%timeit df_a.cumsum()  # 16.1 ms
%timeit jp(df_a)       # 6.05 ms
+2

df_b, df_a,

df_a = pd.DataFrame(np.arange(1,17).reshape(4,4))
df_b = df_a.cumsum()

    0   1   2   3
0   1   2   3   4
1   6   8  10  12
2  15  18  21  24
3  28  32  36  40
+1

Source: https://habr.com/ru/post/1694628/


All Articles