How to use batch conversion across multiple columns

Question

How to use batch conversion across multiple columns

I have a large data framework and I group from one to n columns and want to apply a function to these groups through two columns (e.g. foo and bar).

Here's an example frame:

foo_function = lambda x: np.sum(x.a+xb) df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) # works with apply, but I want transform: df.groupby(['c', 'd'])[['a','b']].apply(foo_function) # transform doesn't work! df.groupby(['c', 'd'])[['a','b']].transform(foo_function) TypeError: cannot concatenate a non-NDFrame object

But transform , apparently, cannot combine several columns together because it looks at each column separately (unlike the application). What is the next best alternative in terms of speed / elegance? for example, I could use apply and then create df['new_col'] with pd.match , but this will require matching sometimes several groupw columns (col1 and col2) that seem really hacked / will take enough code.

-> Is there a function like groupby (). transform, which can use functions that work on multiple columns? If this does not exist, what is the best hack?

+17

python pandas

Hillary Sanders Dec 05 '15 at 0:22

source share

2 answers

As I read the question, you want to be able to do something arbitrary with both values from both columns. You just need to make sure that you return the data frame the same size as you passed. I think the best way is to simply create a new column, for example:

 df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) df['e']=0 def f(x): y=(x['a']+x['b'])/sum(x['b']) return pd.DataFrame({'e':y,'a':x['a'],'b':x['b']}) df.groupby(['c','d']).transform(f)

:

  abe 0 1 1 0.333333 1 2 2 0.666667 2 3 3 1.000000 3 4 4 2.000000 4 5 5 0.909091 5 6 6 1.090909

If you have a very complex framework, you can select your own columns (for example, df.groupby(['c'])['a','b','e'].transform(f) )

This of course looks very inelegant for me, but still much faster than apply on large datasets.

Another alternative is to use set_index to capture all the columns you need and then pass only one column to transform .

+2

Victor Chubukov May 23 '16 at 23:09

source share

unutbu · Accepted Answer · 2015-12-05 02:40

Instead, if you need to perform a group calculation on multiple columns, first do the calculations on multiple columns, and then groupby:

 df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6], 'c':['q', 'q', 'q', 'q', 'w', 'w'], 'd':['z','z','z','o','o','o']}) df['e'] = df['a'] + df['b'] df['e'] = (df.groupby(['c', 'd'])['e'].transform('sum')) print(df)

gives

  abcde 0 1 1 qz 12 1 2 2 qz 12 2 3 3 qz 12 3 4 4 qo 8 4 5 5 wo 22 5 6 6 wo 22

Original answer:

Error message:

 TypeError: cannot concatenate a non-NDFrame object

assumes that for concatenation, foo_function must return an NDFrame (e.g. Series or DataFrame). If you return the series, then:

 In [99]: df.groupby(['c', 'd']).transform(lambda x: pd.Series(np.sum(x['a']+x['b']))) Out[99]: ab 0 12 12 1 12 12 2 12 12 3 8 8 4 22 22 5 22 22

How to use batch conversion across multiple columns

More articles: