The maximum movement with pandas on large datasets is very slow

I have a pandas dataframe dfthat has a DatatimeIndex spanning about 2 years, 2 columns and over 30 million rows of float64 data. I quickly noticed that there is a sharp difference in performance between df.rolling('1d').mean()anddf.rolling('1d').max()

>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.5886592870228924
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.011829487979412079
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
53.8340517100296
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.06093513499945402

As you can see, df.rolling('1d').mean()several hundred times faster than df.rolling('1d').max(). I would expect this to be somewhat faster, since in order to calculate the maximum pandas value, it is presumably necessary to track the order of all the values ​​in the rolling window at each step. However, it’s easy to figure out how to implement this by adding, at most, a log factor, so I expect less difference. If this is the best you can do, using it df.rolling('1d').maxwill hurt in a complete dataset, since it looks like it will take several hours each time.

Performance issues with pandas before (Series.iloc indexing), I'm curious if this is a pandas problem or if there is a faster way to solve this problem.

 

Edit

pandas. 2,35 , , , . hexgnu .

>>> runtime(lambda: df.rolling('1d').max())
2.3093386580003425
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.015023122999991756
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.08013121400290402
>>> n=10000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.6795377829985227
>>> import timeit; r=df.rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.3540661859951797
>>> len(df)
32819278
+4
1

Pandas running max, . , , .. + .

: , , - , .

: min_max pandas

+3

Source: https://habr.com/ru/post/1693108/


All Articles