Pandas data filter calculation

I have the following data file

    student_id  gender  major   admitted
0   35377   female  Chemistry   False
1   56105   male    Physics True
2   31441   female  Chemistry   False
3   51765   male    Physics True
4   53714   female  Physics True
5   50693   female  Chemistry   False
6   25946   male    Physics True
7   27648   female  Chemistry   True
8   55247   male    Physics False
9   35838   male    Physics True

How can I calculate the admission coefficient for women physicists?

+4
source share
4 answers
import numpy as np
np.average(dat['admitted'][(dat['gender']=='female') & (dat['major']=='Physics')].values)

Principle of work: (dat['gender']=='female') & (dat['major']=='Physics')creates a logical series pandasthat can be used to select the correct entries from the series dat['admitted']. Functionality .valuesretrieves these entries into a numpy array. In the end, we take the average of these elements, giving us the transmittance.

+1
source

I think -

df_f = df[(df['gender']=='female') & (df['major']=='Physics')]
df_f['admitted'].mean()

Filters of the first part femaleand Physics. Then we calculate mean.

mean , . Python boolean 0 1, , ( mean), female Physics admitted

+1
import numpy as np
import pandas as pd
df = pd.DataFrame({"gender":np.random.choice(["male","female"],[20]),
                   "admitted":np.random.choice([True,False],[20]),
                   "major":np.random.choice(["Chemistry","Physics"],[20])})

phy_female_admited = df.loc[(df["major"]=="Physics") & (df["admitted"]==True) & ((df["gender"]=="female"))]
phy_female_applied = df.loc[(df["major"]=="Physics") & ((df["gender"]=="female"))]

acceptance_rate = phy_female_admited.shape[0]/phy_female_applied.shape[0]

A bit more extended answer, but basically works the same as DZurico

ignore the line where I create the dataframe and use your own data instead

0
source

The solution for all input rates with groupbyand GroupBy.sizeand GroupBy.transformwith sum:

a = df.groupby(['gender' ,'admitted', 'major']).size()
print (a)
gender  admitted  major    
female  False     Chemistry    3
        True      Chemistry    1
                  Physics      1
male    False     Physics      1
        True      Physics      4
dtype: int64

b = a.groupby(['gender' ,'major']).transform('sum')
print (b)
gender  admitted  major    
female  False     Chemistry    4
        True      Chemistry    4
                  Physics      1
male    False     Physics      5
        True      Physics      5
dtype: int64

c = a.div(b)
print (c)
gender  admitted  major    
female  False     Chemistry    0.75
        True      Chemistry    0.25
                  Physics      1.00
male    False     Physics      0.20
        True      Physics      0.80
dtype: float64

Select by tuples in the row of which you want c:

print (c.loc[('female',True,'Physics')])
1.0

If you want all the values ​​in DataFrame:

d = a.div(b).reset_index(name='rates')
print (d)
   gender  admitted      major  rates
0  female     False  Chemistry   0.75
1  female      True  Chemistry   0.25
2  female      True    Physics   1.00
3    male     False    Physics   0.20
4    male      True    Physics   0.80
0
source

Source: https://habr.com/ru/post/1695635/


All Articles