I have a pandas dataframe dfthat has a DatatimeIndex spanning about 2 years, 2 columns and over 30 million rows of float64 data. I quickly noticed that there is a sharp difference in performance between df.rolling('1d').mean()anddf.rolling('1d').max()
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.5886592870228924
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.011829487979412079
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
53.8340517100296
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.06093513499945402
As you can see, df.rolling('1d').mean()several hundred times faster than df.rolling('1d').max(). I would expect this to be somewhat faster, since in order to calculate the maximum pandas value, it is presumably necessary to track the order of all the values in the rolling window at each step. However, it’s easy to figure out how to implement this by adding, at most, a log factor, so I expect less difference. If this is the best you can do, using it df.rolling('1d').maxwill hurt in a complete dataset, since it looks like it will take several hours each time.
Performance issues with pandas before (Series.iloc indexing), I'm curious if this is a pandas problem or if there is a faster way to solve this problem.
Edit
pandas. 2,35 , , , . hexgnu .
>>> runtime(lambda: df.rolling('1d').max())
2.3093386580003425
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.015023122999991756
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.08013121400290402
>>> n=10000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.6795377829985227
>>> import timeit; r=df.rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.3540661859951797
>>> len(df)
32819278