I have a “reference population” (say, v=np.random.rand(100)
), and I want to calculate the percentile ranking for a given set (for example, np.array([0.3, 0.5, 0.7])
).
Easy to calculate one by one:
def percentile_rank(x):
return (v<x).sum() / len(v)
percentile_rank(0.4)
=> 0.4
(actually there is ootb scipy.stats.percentileofscore
- but it does not work on vectors).
np.vectorize(percentile_rank)(np.array([0.3, 0.5, 0.7]))
=> [ 0.33 0.48 0.71]
This gives the expected results, but I feel that there must be a built-in for this.
I can also fool:
pd.concat([pd.Series([0.3, 0.5, 0.7]),pd.Series(v)],ignore_index=True).rank(pct=True).loc[0:2]
0 0.330097
1 0.485437
2 0.718447
This is bad for two reasons:
- I do not want the test data to
[0.3, 0.5, 0.7]
be part of the ranking. - I do not want to waste time calculating the ranks for a reference population.
So what is the idiomatic way to achieve this?
source
share