Filter strings after groupby pandas

Question

Filter strings after groupby pandas

I have a table in pandas

import pandas as pd
df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})

LeafID count pidx pidy score 0 1 10 10 20 10 1 1 20 10 20 10 2 2 30 300 400 10 3 1 40 10 20 22 4 3 80 30 15 22 5 3 10 40 20 3 6 1 20 20 12 4 7 6 50 10 43 5 8 3 30 20 54 9 9 5 10 45 112 0 10 1 70 20 23 1

I want to make a group and then filter the lines where the appearance of pidx is greater than 2.

That is, rows of filters where pidx is 10 and 20.

I tried using df.groupby ('pidx'). count ()

but it didn’t help me.

Also for these lines I have to do 0.4 * count + 0.6 * score

Required conclusion

LeafID count pidx pidy final_score 1 10 10 20 1 20 10 20 1 40 10 20 6 50 10 43 1 20 20 12 3 30 20 54 1 70 20 23

+6

python pandas

Shubham Jan 24 '17 at 6:27

source share

4 answers

groupby. , , 20 pidx , .

df.groupby('pidx').filter(lambda x: len(x) > 2)

   LeafID  count  pidx  pidy
0       1     10    10    20
1       1     20    10    20
3       1     40    10    20
7       6     50    10    43

+6

Ted Petrou 24 . '17 6:34

pandas

df[df.groupby('pidx').pidx.transform('count') > 2]


   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

+1

piRSquared Jan 24 '17 at 6:33

source share

First of all, your conclusion shows that you do not want to do groupby. Read what the group is doing. What you need:

df2 = df[df['pidx']<=20]
df2.sort_index(by = 'pidx')

This will give you an accurate result. Read the indexing and pandas features. Actually go and read the whole introduction on pandas. It does not take a lot of time.

String operations are also simple using indexing:

df2['final_score']= 0.4*df2['count'] + 0.6*df2['score']

0

V shreyas Jan 24 '17 at 6:48

source share

jezrael · Accepted Answer · 2017-01-24T06:33:50+0000

You can use value_countswith boolean indexingand isin:

df = pd.DataFrame({
    'LeafID':[1,1,2,1,3,3,1,6,3,5,1],
    'pidx':[10,10,300,10,30,40,20,10,30,45,20],
    'pidy':[20,20,400,20,15,20,12,43,54,112,23],
    'count':[10,20,30,40,80,10,20,50,30,10,70],
    'score':[10,10,10,22,22,3,4,5,9,0,1]
})
print (df)
    LeafID  count  pidx  pidy  score
0        1     10    10    20     10
1        1     20    10    20     10
2        2     30   300   400     10
3        1     40    10    20     22
4        3     80    30    15     22
5        3     10    40    20      3
6        1     20    20    12      4
7        6     50    10    43      5
8        3     30    30    54      9
9        5     10    45   112      0
10       1     70    20    23      1

s = df.pidx.value_counts()
idx = s[s>2].index
print (df[df.pidx.isin(idx)])
   LeafID  count  pidx  pidy  score
0       1     10    10    20     10
1       1     20    10    20     10
3       1     40    10    20     22
7       6     50    10    43      5

Delay

np.random.seed(123)
N = 1000000


L1 = list('abcdefghijklmnopqrstu')
L2 = list('efghijklmnopqrstuvwxyz')
df = pd.DataFrame({'LeafId':np.random.randint(1000, size=N),
                   'pidx': np.random.randint(10000, size=N),
                   'pidy': np.random.choice(L2, N),
                   'count':np.random.randint(1000, size=N)})
print (df)


print (df.groupby('pidx').filter(lambda x: len(x) > 120))

def jez(df):
    s = df.pidx.value_counts()
    return df[df.pidx.isin(s[s>120].index)]

print (jez(df))

In [55]: %timeit (df.groupby('pidx').filter(lambda x: len(x) > 120))
1 loop, best of 3: 1.17 s per loop

In [56]: %timeit (jez(df))
10 loops, best of 3: 141 ms per loop

In [62]: %timeit (df[df.groupby('pidx').pidx.transform('size') > 120])
10 loops, best of 3: 102 ms per loop

In [63]: %timeit (df[df.groupby('pidx').pidx.transform(len) > 120])
1 loop, best of 3: 685 ms per loop

In [64]: %timeit (df[df.groupby('pidx').pidx.transform('count') > 120])
10 loops, best of 3: 104 ms per loop

For final_scoreyou can use:

df['final_score'] = df['count'].mul(.4).add(df.score.mul(.6))

Filter strings after groupby pandas

More articles: