The fastest way to find a calculation function on DataFrame slices by column value (Python pandas)

I am trying to create a column in a data frame that contains a minimum of column A (column of values) for which column B (column of id) has a specific value. My code is very slow. I am looking for a faster way to do this. Here is my little function:

def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min): for rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df 

And an example of use:

 import pandas as pd import numpy as np # create data frame df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2], "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]}) print df.head(10) # output id_col val_col 0 0 0.1 1 0 0.2 2 0 0.3 3 1 0.6 4 1 0.4 5 1 0.5 6 2 0.2 7 2 0.1 8 2 0.0 df = apply_by_id_value(df) print df.head(10) # output id_col val_col offset 0 0 0.1 0.1 1 0 0.2 0.1 2 0 0.3 0.1 3 1 0.6 0.4 4 1 0.4 0.4 5 1 0.5 0.4 6 2 0.2 0.0 7 2 0.1 0.0 8 2 0.0 0.0 

Another context: in my real data, the id_col column has about 30,000 or more unique values. This means that the data frame must be cut 30,000 times. I guess this is a bottleneck.

+6
source share
1 answer

Run groupby on 'id_col' and then transform passing the function 'min', this will return the series aligned to the original df so you can add it as a new column:

 In [13]: df = pd.DataFrame({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2], "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]}) df['offset'] = df.groupby('id_col').transform('min') df Out[13]: id_col val_col offset 0 0 0.1 0.1 1 0 0.2 0.1 2 0 0.3 0.1 3 1 0.6 0.4 4 1 0.4 0.4 5 1 0.5 0.4 6 2 0.2 0.0 7 2 0.1 0.0 8 2 0.0 0.0 

<strong> timings

 In [15]: def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min): for rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df %timeit apply_by_id_value(df) %timeit df.groupby('id_col').transform('min') 100 loops, best of 3: 8.12 ms per loop 100 loops, best of 3: 5.99 ms per loop 

So groupby and transform are faster in this dataset, I expect it to be significantly faster on your real dataset, as it will scale better.

For line 80000 df, I get the following timings:

 1 loops, best of 3: 611 ms per loop 1 loops, best of 3: 438 ms per loop 
+5
source

Source: https://habr.com/ru/post/988474/


All Articles