Best way to combine data and save table names and column names with Pandas

Suppose I have a dataset like the following

df = pd.DataFrame({'x1':['a','a','b','b'], 'x2':[True, True, True, False], 'x3':[1,1,1,1]}) df x1 x2 x3 0 a True 1 1 a True 1 2 b True 1 3 b False 1 

I often want to perform a group-aggregate operation, where I group several columns and apply several functions to one column. Also, I usually don't want a multi-indexed multi-level table. To achieve this, he took me three lines of code that seemed excessive.

for instance

 bg = df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}) bg.columns = bg.columns.droplevel(0) bg.reset_index() 

Is there a better way? In order not to cling, but I'm coming from the background of R / data.table, where something like this is a nice one-liner, like

 df[, list(my_sum=sum(x3), my_mean=mean(x3)), by=list(x1, x2)] 
+5
source share
2 answers

You can use @ Happy01's answer, but instead of as_index=False you can add reset_index to the end:

 In [1331]: df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() Out[1331]: x1 x2 my_mean my_sum 0 a True 1 2 1 b False 1 1 2 b True 1 1 

Benchmarking , for reset_index it works faster:

 In [1333]: %timeit df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) 100 loops, best of 3: 3.18 ms per loop In [1334]: %timeit df.groupby(['x1', 'x2'])['x3'].agg( {'my_sum':np.sum, 'my_mean':np.mean}).reset_index() 100 loops, best of 3: 2.82 ms per loop 

You can do the same as your solution, but with one line. Transport your framework, then reset_index to drop the x3 column or level 0, then move back and reset_index again to achieve the desired result:

 In [1374]: df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index() Out[1374]: x1 x2 my_mean my_sum 0 a True 1 2 1 b False 1 1 2 b True 1 1 

But it works slower:

 In [1375]: %timeit df.groupby(['x1', 'x2']).agg({'x3': {'my_sum':np.sum, 'my_mean':np.mean}}).T.reset_index(level=0, drop=True).T.reset_index() 100 loops, best of 3: 5.13 ms per loop 
+2
source

How about this:

 In [81]: bg = df.groupby(['x1', 'x2'], as_index=False)['x3'].agg({'my_sum':np.sum, 'my_mean':np.mean}) In [82]: print bg x1 x2 my_sum my_mean 0 a True 2 1 1 b False 1 1 2 b True 1 1 
+5
source

Source: https://habr.com/ru/post/1239955/


All Articles