Pandas vector function cumsum vs numpy

In response to the question Vectorizing Pandas Dataframe Computing , I noticed an interesting performance issue.

I got the impression that the functions, such as df.min(), df.mean(), df.cumsum(), etc., vekturiruyutsya. However, I see a huge difference in performance between df.cumsum()and the alternative numpy.

Given that it pandasuses arrays in its infrastructure numpy, I expected performance to be closer. I tried to study the source code for df.cumsum(), but found it unsolvable. Can someone explain why it is much slower?

Given the answer from @HYRY, the problem boils down to the question of why the following two teams give such a huge difference in timings:

import pandas as pd, numpy as np
df_a = pd.DataFrame(np.arange(1,1000*1000+1).reshape(1000,1000))

%timeit pd.DataFrame(np.nancumsum(df_a.values))    #  4.18 ms
%timeit df_a.cumsum()                              # 15.7  ms

(Dates run by one of the commentators, since my numpy v1.11 does not have nancumum.)

+4
source share
2 answers

It seems like nothing is worth it.

Firstly, df_a.cumsum()by default axis=0(Pandas has no idea of ​​summing the entire DataFrame in one call), and the default NumPy call is axis=None. Thus, setting the axis in one operation and effectively smoothing the other, you compare apples to oranges.

However, there are three challenges you could compare:

>>> np.cumsum(df_a, axis=0)
>>> df_a.cumsum()
>>> val.cumsum(axis=0)  # val = df_a.values

val NumPy, .values .

, IPython, %prun :

>>> %prun -q -T pdcumsum.txt df_a.cumsum()

>>> val = df_a.values
>>> %prun -q -T ndarraycumsum.txt val.cumsum(axis=0)

>>> %prun -q -T df_npcumsum.txt np.cumsum(df_a, axis=0)

-T , , . :

  • df_a.cumsum(): 186 , 0,022 . 0,013 numpy.ndarray.cumsum(). ( , NaNs, nancumsum() , , , ). .
  • val.cumsum(axis=0): 5 , 0,020 . ( ).
  • np.cumsum(df_a, axis=0): 204 , 0,026 . , Pandas NumPy Pandas, , NumPy.

, %timeit, 1 , %time, %prun; , . , , , , Pandas, NumPy. , np.ndarray.cumsum(), Pandas . , Pandas , , , .

- Wes McKinney,

, , , .

, .

: NumPy , ndarray.cumsum(), np.cumsum(), . , , - .


:

>>> pd.__version__, np.__version__
('0.22.0', '1.14.0')
+6

Pandas NaN, :

a = np.random.randn(1000000)
%timeit np.nancumsum(a)
%timeit np.cumsum(a)

:

9.02 ms Β± 189 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
4.37 ms Β± 18.8 Β΅s per loop (mean Β± std. dev. of 7 runs, 100 loops each)
+1

Source: https://habr.com/ru/post/1694622/


All Articles