Using sorted index + fillna()
You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78 s.
Code example : %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean()))
Output: 1 loop, best of 3: 3.18 s per loop`
improvement piRSquared : %%timeit df[['value']].fillna(df.groupby('group').transform('mean'))
Output: 1 loop, best of 3: 2.78 s per loop
A bit more efficient way (using sorted index and fillna
) :
You can set the group
column as the index of the data frame and sort it.
df = df.set_index('group').sort_index()
Now that you have a sorted index, it is very cheap to access a subset of the data block by group number using df.loc[x,:]
Since you need to assign an average value for each group, you need all the unique identifiers of the group. In this example, you can use range
(since the groups are from 0 to 99), but more generally you can use:
groups = np.unique(set(df.index))
After that, you can fillna()
over the groups and use fillna()
to %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value']))
: %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value']))
Output: 1 loop, best of 3: 231 ms per loop
Note: set_index
, sort_index
and np.unique
are a one-time cost. To be fair for everyone, the total time (including these operations) was 2.26 s on my machine, but part of the imputation took only 231 ms.
source share