How to use batch conversion across multiple columns

I have a large data framework and I group from one to n columns and want to apply a function to these groups through two columns (e.g. foo and bar).

Here's an example frame:

foo_function = lambda x: np.sum(x.a+xb) df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) # works with apply, but I want transform: df.groupby(['c', 'd'])[['a','b']].apply(foo_function) # transform doesn't work! df.groupby(['c', 'd'])[['a','b']].transform(foo_function) TypeError: cannot concatenate a non-NDFrame object 

But transform , apparently, cannot combine several columns together because it looks at each column separately (unlike the application). What is the next best alternative in terms of speed / elegance? for example, I could use apply and then create df['new_col'] with pd.match , but this will require matching sometimes several groupw columns (col1 and col2) that seem really hacked / will take enough code.

-> Is there a function like groupby (). transform, which can use functions that work on multiple columns? If this does not exist, what is the best hack?

+17
python pandas
Dec 05 '15 at 0:22
source share
2 answers

Instead, if you need to perform a group calculation on multiple columns, first do the calculations on multiple columns, and then groupby:

 df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) df['e'] = df['a'] + df['b'] df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum')) print(df) 

gives

  abcde 0 1 1 qz 12 1 2 2 qz 12 2 3 3 qz 12 3 4 4 qo 8 4 5 5 wo 22 5 6 6 wo 22 



Original answer:

Error message:

 TypeError: cannot concatenate a non-NDFrame object 

assumes that for concatenation, foo_function must return an NDFrame (e.g. Series or DataFrame). If you return the series, then:

 In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b']))) Out[99]: ab 0 12 12 1 12 12 2 12 12 3 8 8 4 22 22 5 22 22 
+13
Dec 05 '15 at 2:40
source share

As I read the question, you want to be able to do something arbitrary with both values ​​from both columns. You just need to make sure that you return the data frame the same size as you passed. I think the best way is to simply create a new column, for example:

 df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) df['e']=0 def f(x): y=(x['a']+x['b'])/sum(x['b']) return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']}) df.groupby(['c','d']).transform(f) 

:

  abe 0 1 1 0.333333 1 2 2 0.666667 2 3 3 1.000000 3 4 4 2.000000 4 5 5 0.909091 5 6 6 1.090909 

If you have a very complex framework, you can select your own columns (for example, df.groupby(['c'])['a','b','e'].transform(f) )

This of course looks very inelegant for me, but still much faster than apply on large datasets.

Another alternative is to use set_index to capture all the columns you need and then pass only one column to transform .

+2
May 23 '16 at 23:09
source share



All Articles