Is pandas the wrong percentile?

I am working with this WNBA dataset here . I am analyzing a variable Height, and below is a table showing the frequency, cumulative percentage and cumulative frequency for each recorded height value:

img

From the table, I can easily conclude that the first quartile (25th percentile) cannot be greater than 175.

However, when I use Series.describe(), they tell me that the 25th percentile is 176.5. Why is this so?

wnba.Height.describe()
count    143.000000
mean     184.566434
std        8.685068
min      165.000000
25%      176.500000
50%      185.000000
75%      191.000000
max      206.000000
Name: Height, dtype: float64
+4
source share
3 answers

There are various ways to evaluate quantiles.
175.0 vs 176.5 refers to two different methods:

  • Includes Q1 (this gives 176.5) and
  • Excludes Q1 (gives 175.0).

The assessment differs as follows:

#1
h = (N โˆ’ 1)*p + 1 #p being 0.25 in your case
Est_Quantile =  xโŒŠhโŒ‹ + (h โˆ’ โŒŠhโŒ‹)*(xโŒŠhโŒ‹ + 1 โˆ’ xโŒŠhโŒ‹)

#2
h = (N + 1)*p   
xโŒŠhโŒ‹ + (h โˆ’ โŒŠhโŒ‹)*(xโŒŠhโŒ‹ + 1 โˆ’ xโŒŠhโŒ‹) 
+4

. . , 1 25- :

, 1 n n/2, (n + 1)/2. , , p * n .

+1

, describe() .

, pandas
( , ).

, , .quantile() Height, 'lower':

df = pd.read_csv('../input/WNBA Stats.csv')
df.Height.quantile(0.25,interpolation='lower') #interpolation lower to get what you expect

. .


Please note that @jpp said :

There are many percentile definitions.

You can see this answer too , which talks about the differences between calculations numpyand pandaspercentiles.

0
source

Source: https://habr.com/ru/post/1694218/


All Articles