How to reduce pandas data size

Question

How to reduce pandas data size

I am trying to reduce the size of pandas data in order to reduce the granularity. As an example, I want to reduce this data frame:

1 2 3 4 2 4 3 3 2 2 1 3 3 1 3 2

to this (downsampling to obtain a 2 × 2 data frame using the average value):

 2.25 3.25 2 2.25

Is there a built-in way or an efficient way to do this, or do I need to write it myself?

thanks

+4

python pandas dataframe

gc5 Sep 16 '13 at 10:09

source share

2 answers

You can use the rolling_mean function, which is applied twice, first by columns, then by rows, and then cut the results:

 rbs = 2 # row block size cbs = 2 # column block size pd.rolling_mean(pd.rolling_mean(df.T, cbs, center=True)[cbs-1::cbs].T, rbs)[rbs-1::rbs]

Which gives the same result you want, except that the index will be different (but you can fix it with .reset_index(drop=True) ):

  1 3 1 2.25 3.25 3 2.00 2.25

Time Information:

 In [11]: df = pd.DataFrame(np.random.randn(100, 100)) In [12]: %%timeit pd.rolling_mean(pd.rolling_mean(df.T, 2, center=True)[1::2].T, 2)[1::2] 100 loops, best of 3: 4.75 ms per loop In [13]: %%timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean() 100 loops, best of 3: 932 µs per loop

So, it is about 5 times slower than a group, not 800x :)

+2

Viktor Kerkez Sep 16 '13 at 10:32

source share

Andy hayden · Accepted Answer · 2013-09-16T14:35:22+0000

One option is to use the group twice. Once for an index:

 In [11]: df.groupby(lambda x: x/2).mean() Out[11]: 0 1 2 3 0 1.5 3.0 3 3.5 1 2.5 1.5 2 2.5

and once for columns:

 In [12]: df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean() Out[12]: 0 1 0 2.25 3.25 1 2.00 2.25

Note. A solution that calculates only one value may be preferable ... one of the options is stacking, grouping, average, and peeling, but atm is a bit difficult.

This seems significantly faster than Vicktor's solution :

 In [21]: df = pd.DataFrame(np.random.randn(100, 100)) In [22]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean() 1000 loops, best of 3: 1.64 ms per loop In [23]: %timeit viktor() 1 loops, best of 3: 822 ms per loop

In fact, Viktor's solution comes from my (not powerful enough) laptop for large DataFrames:

 In [31]: df = pd.DataFrame(np.random.randn(1000, 1000)) In [32]: %timeit df.groupby(lambda x: x/2).mean().groupby(lambda y: y/2, axis=1).mean() 10 loops, best of 3: 42.9 ms per loop In [33]: %timeit viktor() # crashes

As Victor points out, this does not work with an integer index, if necessary, you can just save them as temporary variables and return them back after:

 df_index, df_cols, df.index, df.columns = df.index, df.columns, np.arange(len(df.index)), np.arange(len(df.columns)) res = df.groupby(... res.index, res.columns = df_index[::2], df_cols[::2]

How to reduce pandas data size

More articles: