I have a df that needs to be grouped, filtered, modified and use a special function. My “normal” approach is slow, not the most elegant!
[name] [cnt] [num] [place] [y] AAAA 12 20182.0 5.0 1.75 BBBB 12 20182.0 7.0 2.00 AAAA 10 20381.0 10.0 9.25 BBBB 10 20381.0 12.0 18.75 EEEE 12 21335.0 1.0 0.00 RRRR 12 21335.0 8.0 3.00 CCCC 12 21335.0 9.0 3.50
I need to group df into [num] ie:
[name] [cnt] [num] [place] [y] AAAA 12 20182.0 5.0 1.75 BBBB 12 20182.0 7.0 2.00
For each of these groups, I need to complete three tasks:
I am. Filter all rows within the same group with the same value [y]. Groups can consist of 6 values.
II. Create all possible subsets with a length of two for [place]: (5.7) and (7.5)
III. Apply a custom function to each subset:
def func(p1, p2): diff_p = p2-p1 if diff_p > 0: return 2 / (diff_p * p2) else: return p1 / (diff_p * 12)
Where p1 = first place of the tuple; p2 = second place of the tuple; 12 is the value from the [cnt] column. What gives a group of examples:
[name] [cnt] [num] [place] [y] [desired] AAAA 12 20182.0 5.0 1.75 0.1428571429 BBBB 12 20182.0 7.0 2.00 -0.2916666667
The AAAA [desired] column contains the average “user function result” of all subsets, where the AAAA location value is the first part of the tuple. This is only one tuple in this example.
(But, as already mentioned, groups can consist of 6 values, which creates several tuples where AAAA is the first value)
My current approach is to do
df.groupby ('Num'). Apply (...)
apply:
.drop_duplicates ('y', hold = False)
list (itertools.permutations (df_grp.place.values, 2))
apply custom function
.mean ()
This gets really very slow after a while, since the first df is the result of another .groupby (). apply () call