Apply vs transform to group object

Consider the following data file:

ABCD 0 foo one 0.162003 0.087469 1 bar one -1.156319 -1.526272 2 foo two 0.833892 -1.666304 3 bar three -2.026673 -0.322057 4 foo two 0.411452 -0.954371 5 bar two 0.765878 -0.095968 6 foo one -0.654890 0.678091 7 foo three -1.789842 -1.130922 

The following commands work:

 > df.groupby('A').apply(lambda x: (x['C'] - x['D'])) > df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean()) 

but none of the following works:

 > df.groupby('A').transform(lambda x: (x['C'] - x['D'])) ValueError: could not broadcast input array from shape (5) into shape (5,3) > df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean()) TypeError: cannot concatenate a non-NDFrame object 

Why? The example in the documentation seems to suggest that calling transform in the group allows for line-by-line processing:

 # Note that the following suggests row-wise operation (x.mean is the column mean) zscore = lambda x: (x - x.mean()) / x.std() transformed = ts.groupby(key).transform(zscore) 

In other words, I thought conversion was essentially a specific type of application (one that is not aggregated). Where am I wrong?

For reference, below is the design of the original data frame above:

 df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : randn(8), 'D' : randn(8)}) 
+44
python pandas
Dec 17 '14 at 2:27
source share
2 answers

Just as I felt dimly with the .transform vs. .apply , I found several answers that shed light on this issue. This answer was very helpful.

So far, my benefit is that .transform will work (or deal) with Series (columns) in isolation from each other . This means that in the last two calls:

 df.groupby('A').transform(lambda x: (x['C'] - x['D'])) df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean()) 

You asked .transform to take values ​​from two columns, and "he" does not actually "see" both of them at the same time (so to speak). transform will examine the dataframe columns one at a time and return a series (or a group of rows) made from scalars that are repeated len(input_column) times.

Thus, this scalar, which .transform should use to create the Series , is the result of some reduction function applied at the Series input (and only at ONE series / column at a time).

Consider this example (on your frame):

 zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column. df.groupby('A').transform(zscore) 

will give:

  CD 0 0.989 0.128 1 -0.478 0.489 2 0.889 -0.589 3 -0.671 -1.150 4 0.034 -0.285 5 1.149 0.662 6 -1.404 -0.907 7 -0.509 1.653 

This is exactly the same as if you used it only one column at a time:

 df.groupby('A')['C'].transform(zscore) 

getting:

 0 0.989 1 -0.478 2 0.889 3 -0.671 4 0.034 5 1.149 6 -1.404 7 -0.509 

Please note that .apply in the last example ( df.groupby('A')['C'].apply(zscore) ) will work the same way, but it will not work if you try to use it on a data frame:

 df.groupby('A').apply(zscore) 

gives an error:

 ValueError: operands could not be broadcast together with shapes (6,) (2,) 

So where else is .transform useful? The simplest case is an attempt to return the results of the restore function back to the original frame.

 df['sum_C'] = df.groupby('A')['C'].transform(sum) df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group 

getting:

  ABCD sum_C 1 bar one 1.998 0.593 3.973 3 bar three 1.287 -0.639 3.973 5 bar two 0.687 -1.027 3.973 4 foo two 0.205 1.274 4.373 2 foo two 0.128 0.924 4.373 6 foo one 2.113 -0.516 4.373 7 foo three 0.657 -1.179 4.373 0 foo one 1.270 0.201 4.373 

Trying the same with .apply will give NaNs in sum_C . Because .apply will return a smaller Series that doesn't know how to pass back:

 df.groupby('A')['C'].apply(sum) 

giving:

 A bar 3.973 foo 4.373 

There are also cases where .transform used to filter data:

 df[df.groupby(['B'])['D'].transform(sum) < -1] ABCD 3 bar three 1.287 -0.639 7 foo three 0.657 -1.179 

Hope this adds a bit more clarity.

+71
Jan 14 '15 at 20:34
source share

The two main differences between apply and transform

There are two main differences between transform and apply groupby methods.

  • apply implicitly passes all the columns for each group as a DataFrame for the user-defined function, and transform passes each column for each group as a Series user-defined function
  • The user-defined function passed to apply can return a scalar, or Series or DataFrame (or a numpy array or even a list). The user-defined function passed to transform must return a sequence (one-dimensional series, array, or list) of the same length as the group.

So, transform only works one series at a time, and apply works right away with the entire DataFrame.

User Function Check

This may help a little to verify the input of your user-defined function passed to apply or transform .

Examples

Let me create some sample data and inspect the groups so you can see what I'm talking about:

 df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 'a':[4,5,1,3], 'b':[6,10,3,11]}) df 

Let us create a simple custom function that infers the type of an implicitly passed object and then raises an error so that execution can be stopped.

 def inspect(x): print(type(x)) raise 

Now pass this function to the groupto apply and transform methods to see which object is passed to it:

 df.groupby('State').apply(inspect) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> RuntimeError 

As you can see, the DataFrame is passed to the inspect function. You might be wondering why the DataFrame type is being printed twice. Pandas performs the first group twice. He does this to determine if there is a quick way to complete the calculation or not. This is a small detail that you should not worry about.

Now do the same with transform

 df.groupby('State').transform(inspect) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> RuntimeError 

The series goes by - a completely different Pandas object.

Thus, transform allowed to work with only one series at a time. It is impossible for him to act on two columns at the same time. So, if we try to subtract column a from b inside our user-defined function, we get an error with transform . See below:

 def subtract_two(x): return x['a'] - x['b'] df.groupby('State').transform(subtract_two) KeyError: ('a', 'occurred at index a') 

We get a KeyError as Pandas tries to find a series index a that does not exist. You can perform this operation with apply , since it has an entire DataFrame:

 df.groupby('State').apply(subtract_two) State Florida 2 -2 3 -8 Texas 0 -2 1 -5 dtype: int64 

The output is a series and a bit confusing as the original index is preserved, but we have access to all the columns.




Display Pandas Traversed Object

This can help to further display the entire Pandas object in a user-defined function so that you can see exactly what you are working with. You can use print statements, I like to use the display function from the IPython.display module IPython.display that DataFrames gets beautifully output jupyter in HTML on a laptop:

 from IPython.display import display def subtract_two(x): display(x) return x['a'] - x['b'] 

Screenshot: enter image description here




The transformation should return a one-dimensional sequence of the same size as the group

Another difference is that transform must return a one-dimensional sequence of the same size as the group. In this particular case, each group has two lines, so transform should return a sequence of two lines. If this is not the case, an error occurs:

 def return_three(x): return np.array([1, 2, 3]) df.groupby('State').transform(return_three) ValueError: transform must return a scalar value for each group 

The error message does not actually describe the problem. You must return a sequence of the same length as the group. Thus, this function will work:

 def rand_group_len(x): return np.random.rand(len(x)) df.groupby('State').transform(rand_group_len) ab 0 0.962070 0.151440 1 0.440956 0.782176 2 0.642218 0.483257 3 0.056047 0.238208 



Returning a single scalar object also works for transform

If you return only one scalar from your custom function, then transform will use it for each of the lines in the group:

 def group_sum(x): return x.sum() df.groupby('State').transform(group_sum) ab 0 9 16 1 9 16 2 4 14 3 4 14 
+6
Nov 06 '17 at 18:03
source share



All Articles