Pandas: filling in missing values ​​by average values ​​in each group is faster than transfrom

I need to fill in the missing values ​​in the pandas DataFrame with the average value in each group. According to this question transform can achieve this.

However, transform too slow for my purposes.

For example, take the following parameter with a large DataFrame with 100 different groups and 70% NaN values:

 import pandas as pd import numpy as np size = 10000000 # DataFrame length ngroups = 100 # Number of Groups randgroups = np.random.randint(ngroups, size=size) # Creation of groups randvals = np.random.rand(size) * randgroups * 2 # Random values with mean like group number nan_indices = np.random.permutation(range(size)) # NaN indices nanfrac = 0.7 # Fraction of NaN values nan_indices = nan_indices[:int(nanfrac*size)] # Take fraction of NaN indices randvals[nan_indices] = np.NaN # Set NaN values df = pd.DataFrame({'value': randvals, 'group': randgroups}) # Create data frame 

Using transform through

 df.groupby("group").transform(lambda x: x.fillna(x.mean())) # Takes too long 

takes more than 3 seconds on my computer. I need something an order of magnitude faster (buying a larger machine is not an option: -D).

So how can I fill in the missing values ​​faster?

+6
source share
3 answers

Here's the NumPy approach using np.bincount , which is pretty effective for such buffer-based summing / averaging operations -

 ids = df.group.values # Extract 2 columns as two arrays vals = df.value.values m = np.isnan(vals) # Mask of NaNs grp_sums = np.bincount(ids,np.where(m,0,vals)) # Group sums with NaNs as 0s avg_vals = grp_sums*(1.0/np.bincount(ids,~m)) # Group averages vals[m] = avg_vals[ids[m]] # Set avg values into NaN positions 

Note that this will update the value column.

Runtime test

Data:

 size = 1000000 # DataFrame length ngroups = 10 # Number of Groups 

Dates:

 In [17]: %timeit df.groupby("group").transform(lambda x: x.fillna(x.mean())) 1 loops, best of 3: 276 ms per loop In [18]: %timeit bincount_based(df) 100 loops, best of 3: 13.6 ms per loop In [19]: 276.0/13.6 # Speedup Out[19]: 20.294117647058822 

20x+ acceleration there!

+3
source

you are doing it wrong. it's slow because you use lambda

 df[['value']].fillna(df.groupby('group').transform('mean')) 
+3
source

Using sorted index + fillna()

You are right - your code takes 3.18s to run. The code provided by @piRSquared takes 2.78 s.

  • Code example : %%timeit df2 = df1.groupby("group").transform(lambda x: x.fillna(x.mean())) Output: 1 loop, best of 3: 3.18 s per loop`

  • improvement piRSquared : %%timeit df[['value']].fillna(df.groupby('group').transform('mean')) Output: 1 loop, best of 3: 2.78 s per loop

  • A bit more efficient way (using sorted index and fillna ) :

You can set the group column as the index of the data frame and sort it.

df = df.set_index('group').sort_index()

Now that you have a sorted index, it is very cheap to access a subset of the data block by group number using df.loc[x,:]

Since you need to assign an average value for each group, you need all the unique identifiers of the group. In this example, you can use range (since the groups are from 0 to 99), but more generally you can use:

groups = np.unique(set(df.index))

After that, you can fillna() over the groups and use fillna() to %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value'])) : %%timeit for x in groups: df.loc[x,'value'] = df.loc[x,'value'].fillna(np.mean(df.loc[x,'value'])) Output: 1 loop, best of 3: 231 ms per loop

Note: set_index , sort_index and np.unique are a one-time cost. To be fair for everyone, the total time (including these operations) was 2.26 s on my machine, but part of the imputation took only 231 ms.

+3
source

Source: https://habr.com/ru/post/1012460/


All Articles