I have a large data frame (40M rows) and I want to filter rows based on a single column if the value matches the condition in the groupby object.
For example, here are some random data. The "letter" column will have thousands of unique values:
xyz letter 0 47 86 30 e 1 58 9 28 b 2 96 59 42 a 3 79 6 45 e 4 77 80 37 d 5 66 91 35 d 6 96 31 52 d 7 56 8 26 e 8 78 96 14 a 9 22 60 13 e 10 75 82 9 d 11 5 54 29 c 12 83 31 40 e 13 37 70 2 c 14 53 67 66 a 15 76 33 78 d 16 64 67 81 b 17 23 94 1 d 18 10 1 31 e 19 52 11 3 d
Apply the group in the letter column and get the sum of the x column for each letter:
df.groupby('letter').x.sum() >>> a 227 b 122 c 42 d 465 e 297
Then I sort to see the letters with the largest amount and manually determine the threshold. In this example, the threshold may be 200.
df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False) >>> letter x 3 d 465 4 e 297 0 a 227 1 b 122 2 c 42
Here where I am stuck. In the original framework, I want to save the letters if the group from the sum of the column 'x'> 200, and discard the remaining rows. Therefore, in this example, it will save all rows with d, e, or a.
I tried something like this, but it does not work:
df.groupby('letter').x.sum().filter(lambda x: len(x) > 200)
And even if I filter the groupby object, how can I use it to filter the original frame?