I work with relative large DataFrame(~ 4M lines x 11 cols, numeric ctypes).
I need to make a manipulation on the basis of groupby, in particular transform, and aggregates. Around I work with groups O(1M).
With my machine (i7 2600k, 8GB RAM, Fedora20x64), I noticed that it is almost impossible to perform any manipulation groupbythan the "built-in" ones.
eg.
df.groupby('key').Acol.mean()
takes a split second and
df.groupby('key').Acol.aggregate(pd.Series.mean)
may take minutes, and memory consumption explodes.
, lambda, pd.Series , , .
Q: "" ?
- , ? cython ?
- aggregate transform.
, "" ( - ?)
:
df ( ), hdf5, 4 , , 'hdf5' , , df. , .