The maximum movement with pandas on large datasets is very slow

Question

The maximum movement with pandas on large datasets is very slow

I have a pandas dataframe dfthat has a DatatimeIndex spanning about 2 years, 2 columns and over 30 million rows of float64 data. I quickly noticed that there is a sharp difference in performance between df.rolling('1d').mean()anddf.rolling('1d').max()

>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.5886592870228924
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.011829487979412079
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
53.8340517100296
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.mean(), number=1)
0.06093513499945402

As you can see, df.rolling('1d').mean()several hundred times faster than df.rolling('1d').max(). I would expect this to be somewhat faster, since in order to calculate the maximum pandas value, it is presumably necessary to track the order of all the values in the rolling window at each step. However, it’s easy to figure out how to implement this by adding, at most, a log factor, so I expect less difference. If this is the best you can do, using it df.rolling('1d').maxwill hurt in a complete dataset, since it looks like it will take several hours each time.

Performance issues with pandas before (Series.iloc indexing), I'm curious if this is a pandas problem or if there is a faster way to solve this problem.

Edit

pandas. 2,35 , , , . hexgnu .

>>> runtime(lambda: df.rolling('1d').max())
2.3093386580003425
>>> n=100000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.015023122999991756
>>> n=1000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.08013121400290402
>>> n=10000000; import timeit; r=df[:n].rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
0.6795377829985227
>>> import timeit; r=df.rolling('1d'); timeit.timeit(lambda: r.max(), number=1)
2.3540661859951797
>>> len(df)
32819278

+4

python pandas

Qudit 03 . '18 1:40

1

Marat · Accepted Answer · 2018-02-03T02:29:22+0000

Pandas running max, . , , .. + .

: , , - , .

: min_max pandas

The maximum movement with pandas on large datasets is very slow

Edit

More articles: