Filter out groups that do not have enough rows that satisfy the condition

I have the following pandas framework.

import pandas as pd

# Initialize dataframe
df1 = pd.DataFrame(columns=['bar', 'foo'])
df1['bar'] = ['001', '001', '001', '001', '002', '002', '003', '003', '003']
df1['foo'] = [-4, -3, 2, 3, -3, -2, 0, 1, 2]
>>> print df1
   bar  foo
0  001   -4
1  001   -3
2  001    2
3  001    3
4  002   -3
5  002   -2
6  003    0
7  003    1
8  003    2

Consider the following threshold and parameters.

# Provide threshold and number of entries above and below threshold
threshold = 0
n_below = 2
n_above = 2

I would like to create a data frame that filters out certain values bar. barI want to filter out the following: if it has no values n_below fooless thresholdand more n_abovevalues foothan threshold.

In the above example:

  • The group bar = 001will not be filtered out, since bar = 001there are at least n_below = 2records fooless threshold = 0and less n_above = 2records foolonger threshold = 0.
  • The group bar = 002will be filtered, because for bar = 002at least more n_above = 2entries foothan threshold = 0.
  • bar = 003 , bar = 003 n_below = 2 foo , threshold = 0.

:

# Desired output
   bar  foo
0  001   -4
1  001   -3
2  001    2
3  001    3

, GroupBy .count(), , . , , : 1) , n_below; 2), , n_above.

+4
3

groupby filter.

threshold = 0
n_below = 2
n_above = 2
def filter_function(g):
    '''Called by filter, g is the grouped dataframe'''
    l = g['foo']
    return  (sum([x < threshold for x in l]) >= n_below 
            and sum([x > threshold for x in l]) >= n_above)

df.groupby('bar').filter(filter_function)

# gives
    bar foo
0   1   -4
1   1   -3
2   1   2
3   1   3

. Pandas:

+3

, :

threshold = 1
n_below = 2
n_above = 2

df1.set_index('bar').loc[ \
    df1.groupby('bar')\
       .apply(lambda df_sub: \
                    (df_sub['foo']<threshold).sum()>=n_below \
                and (df_sub['foo']>threshold).sum()>=n_above)] \
.reset_index('bar')

    bar foo
0   001 -4
1   001 -3
2   001 2
3   001 3
+2
idx = df1.groupby('bar').apply(lambda x: (sum(x['foo'] < threshold) >= n_below) & (sum(x['foo'] > threshold) >= n_above))

print df1.set_index('bar')[idx].reset_index()

   bar  foo
0  001   -4
1  001   -3
2  001    2
3  001    3
+1

Source: https://habr.com/ru/post/1669776/


All Articles