Best way to combine data and save table names and column names with Pandas

Question

Best way to combine data and save table names and column names with Pandas

Suppose I have a dataset like the following

df = pd.DataFrame({'x1':['a','a','b','b'], 'x2':[True, True, True, False], 'x3':[1,1,1,1]}) df x1 x2 x3 0 a True 1 1 a True 1 2 b True 1 3 b False 1

I often want to perform a group-aggregate operation, where I group several columns and apply several functions to one column. Also, I usually don't want a multi-indexed multi-level table. To achieve this, he took me three lines of code that seemed excessive.

for instance

 bg = df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}) bg.columns = bg.columns.droplevel(0) bg.reset_index()

Is there a better way? In order not to cling, but I'm coming from the background of R / data.table, where something like this is a nice one-liner, like

 df[, list(my_sum=sum(x3), my_mean=mean(x3)), by=list(x1, x2)]

+5

python pandas

Ben Jan 6 '16 at 3:49

source share

2 answers

How about this:

 In [81]: bg = df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) In [82]: print bg x1 x2 my_sum my_mean 0 a True 2 1 1 b False 1 1 2 b True 1 1

+5

Happy001 Jan 6 '16 at 4:16

source share

Anton Protopopov · Accepted Answer · 2016-01-06T11:12:19+0000

You can use @ Happy01's answer, but instead of as_index=False you can add reset_index to the end:

 In [1331]: df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() Out[1331]: x1 x2 my_mean my_sum 0 a True 1 2 1 b False 1 1 2 b True 1 1

Benchmarking , for reset_index it works faster:

 In [1333]: %timeit df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) 100 loops, best of 3: 3.18 ms per loop In [1334]: %timeit df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() 100 loops, best of 3: 2.82 ms per loop

You can do the same as your solution, but with one line. Transport your framework, then reset_index to drop the x3 column or level 0, then move back and reset_index again to achieve the desired result:

 In [1374]: df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index() Out[1374]: x1 x2 my_mean my_sum 0 a True 1 2 1 b False 1 1 2 b True 1 1

But it works slower:

 In [1375]: %timeit df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index() 100 loops, best of 3: 5.13 ms per loop

Best way to combine data and save table names and column names with Pandas

More articles: