The two main differences between apply and transform
There are two main differences between transform and apply groupby methods.
apply implicitly passes all the columns for each group as a DataFrame for the user-defined function, and transform passes each column for each group as a Series user-defined function- The user-defined function passed to
apply can return a scalar, or Series or DataFrame (or a numpy array or even a list). The user-defined function passed to transform must return a sequence (one-dimensional series, array, or list) of the same length as the group.
So, transform only works one series at a time, and apply works right away with the entire DataFrame.
User Function Check
This may help a little to verify the input of your user-defined function passed to apply or transform .
Examples
Let me create some sample data and inspect the groups so you can see what I'm talking about:
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 'a':[4,5,1,3], 'b':[6,10,3,11]}) df
Let us create a simple custom function that infers the type of an implicitly passed object and then raises an error so that execution can be stopped.
def inspect(x): print(type(x)) raise
Now pass this function to the groupto apply and transform methods to see which object is passed to it:
df.groupby('State').apply(inspect) <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> RuntimeError
As you can see, the DataFrame is passed to the inspect function. You might be wondering why the DataFrame type is being printed twice. Pandas performs the first group twice. He does this to determine if there is a quick way to complete the calculation or not. This is a small detail that you should not worry about.
Now do the same with transform
df.groupby('State').transform(inspect) <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> RuntimeError
The series goes by - a completely different Pandas object.
Thus, transform allowed to work with only one series at a time. It is impossible for him to act on two columns at the same time. So, if we try to subtract column a from b inside our user-defined function, we get an error with transform . See below:
def subtract_two(x): return x['a'] - x['b'] df.groupby('State').transform(subtract_two) KeyError: ('a', 'occurred at index a')
We get a KeyError as Pandas tries to find a series index a that does not exist. You can perform this operation with apply , since it has an entire DataFrame:
df.groupby('State').apply(subtract_two) State Florida 2 -2 3 -8 Texas 0 -2 1 -5 dtype: int64
The output is a series and a bit confusing as the original index is preserved, but we have access to all the columns.
Display Pandas Traversed Object
This can help to further display the entire Pandas object in a user-defined function so that you can see exactly what you are working with. You can use print statements, I like to use the display function from the IPython.display module IPython.display that DataFrames gets beautifully output jupyter in HTML on a laptop:
from IPython.display import display def subtract_two(x): display(x) return x['a'] - x['b']
Screenshot: 
The transformation should return a one-dimensional sequence of the same size as the group
Another difference is that transform must return a one-dimensional sequence of the same size as the group. In this particular case, each group has two lines, so transform should return a sequence of two lines. If this is not the case, an error occurs:
def return_three(x): return np.array([1, 2, 3]) df.groupby('State').transform(return_three) ValueError: transform must return a scalar value for each group
The error message does not actually describe the problem. You must return a sequence of the same length as the group. Thus, this function will work:
def rand_group_len(x): return np.random.rand(len(x)) df.groupby('State').transform(rand_group_len) ab 0 0.962070 0.151440 1 0.440956 0.782176 2 0.642218 0.483257 3 0.056047 0.238208
Returning a single scalar object also works for transform
If you return only one scalar from your custom function, then transform will use it for each of the lines in the group:
def group_sum(x): return x.sum() df.groupby('State').transform(group_sum) ab 0 9 16 1 9 16 2 4 14 3 4 14