Pandas: achieving the speed of "built-in" methods (for example, mean, std) for group manipulations

I work with relative large DataFrame(~ 4M lines x 11 cols, numeric ctypes).

I need to make a manipulation on the basis of groupby, in particular transform, and aggregates. Around I work with groups O(1M).

With my machine (i7 2600k, 8GB RAM, Fedora20x64), I noticed that it is almost impossible to perform any manipulation groupbythan the "built-in" ones.

eg.

  df.groupby('key').Acol.mean()

takes a split second and

  df.groupby('key').Acol.aggregate(pd.Series.mean)

may take minutes, and memory consumption explodes.

, lambda, pd.Series , , .

Q: "" ?

- , ? cython ?


- aggregate transform.

, "" ( - ?)

:

df ( ), hdf5, 4 , , 'hdf5' , , df. , .

+4
1

O ( ).

.

  • ,
  • cython python ( ).

, . (apply/aggregate) , pandas , , python.

In [28]: df = DataFrame(np.random.randn(4000000,11))

In [29]: df.groupby(df.index//4).ngroups
Out[29]: 1000000

In [30]: %timeit df.groupby(df.index//4).mean()
1 loops, best of 3: 412 ms per loop

In [31]: %timeit -n 1 df.groupby(df.index//4).apply(lambda x: x.mean())
1 loops, best of 3: 1min 22s per loop

.aggregates(pd.Series.mean) - .apply(lambda x: x.mean()) .

, , esp .

-, , :

max-min, :

df.groupby(...).apply(lambda x: x.max()-x.min())

:

g = df.groupby(...)
g.max()-g.min()
+2

Source: https://habr.com/ru/post/1545962/


All Articles