Pandas filter with groupby results

I have a large data frame (40M rows) and I want to filter rows based on a single column if the value matches the condition in the groupby object.

For example, here are some random data. The "letter" column will have thousands of unique values:

xyz letter 0 47 86 30 e 1 58 9 28 b 2 96 59 42 a 3 79 6 45 e 4 77 80 37 d 5 66 91 35 d 6 96 31 52 d 7 56 8 26 e 8 78 96 14 a 9 22 60 13 e 10 75 82 9 d 11 5 54 29 c 12 83 31 40 e 13 37 70 2 c 14 53 67 66 a 15 76 33 78 d 16 64 67 81 b 17 23 94 1 d 18 10 1 31 e 19 52 11 3 d 

Apply the group in the letter column and get the sum of the x column for each letter:

 df.groupby('letter').x.sum() >>> a 227 b 122 c 42 d 465 e 297 

Then I sort to see the letters with the largest amount and manually determine the threshold. In this example, the threshold may be 200.

 df.groupby('letter').x.sum().reset_index().sort_values('x', ascending=False) >>> letter x 3 d 465 4 e 297 0 a 227 1 b 122 2 c 42 

Here where I am stuck. In the original framework, I want to save the letters if the group from the sum of the column 'x'> 200, and discard the remaining rows. Therefore, in this example, it will save all rows with d, e, or a.

I tried something like this, but it does not work:

 df.groupby('letter').x.sum().filter(lambda x: len(x) > 200) 

And even if I filter the groupby object, how can I use it to filter the original frame?

+5
source share
1 answer

You can use groupby transform to calculate the sum of x for each row and create a logical series with a condition with which you can execute a subset:

 df1 = df[df.x.groupby(df.letter).transform('sum') > 200] df1.letter.unique() # array(['e', 'a', 'd'], dtype=object) 

Or another option using groupby.filter :

 df2 = df.groupby('letter').filter(lambda g: gxsum() > 200) df2.letter.unique() # array(['e', 'a', 'd'], dtype=object) 
+5
source

Source: https://habr.com/ru/post/1263608/


All Articles