Lower and upper quartiles in a box in R

I have

X=c(20 ,18, 34, 45, 30, 51, 63, 52, 29, 36, 27, 24) 

With boxplot , I try to build quantile(X,0.25) and quantile(X,0.75) but these are unrealistic the same lower and upper quartiles in a box in R

 boxplot(X) abline(h=quantile(X,0.25),col="red",lty=2) abline(h=quantile(X,0.75),col="red",lty=2) 

enter image description here Do you know, why?

+5
source share
2 answers

Field values ​​are called hinges and can coincide with quartiles (calculated by quantile(x, c(0.25, .075)) ), but they are calculated differently.

From ?boxplot.stats :

Two β€œhinges” are versions of the first and third quartiles, i.e. close to quantiles (x, s (1,3) / 4). The loops are equal to quartiles for odd n (where n <length (x)) and differ for even n. While the quartiles are only equal to the observations for n %% 4 == 1 (n = 1 mod 4), the hinges do this additionally for n %% 4 == 2 (n = 2 mod 4) and are in the middle of the two otherwise case.

To see that the values ​​match an odd number of observations, try the following code:

 set.seed(1234) x <- rnorm(9) boxplot(x) abline(h=quantile(x, c(0.25, 0.75)), col="red") 

enter image description here

+7
source

The mismatch arises from the ambiguity in the definition of quantiles. No method is strictly correct or incorrect - there are simply different ways to estimate quantiles in situations (for example, an even number of data points), when they do not accurately coincide with a specific data point and must be interpolated. Somewhat embarrassingly, boxplot and quantile (and other functions that provide summary statistics) use different default methods to calculate quanta, although these default values ​​can be overloaded with the type = argument in quantile

We can see these differences more clearly in action by looking at some of the different ways of generating quantile statistics in R.

Both boxplot and fivenum give the same value:

 boxplot.stats(X)$stats # [1] 18.0 25.5 32.0 48.0 63.0 fivenum(X) # [1] 18.0 25.5 32.0 48.0 63.0 

In boxplot and fivenum lower (upper) quartile is equivalent to the median of the lower (upper) half of the data (including the median of the full data):

 c(median(X[ X <= median(X) ]), median(X[ X >= median(X) ])) # [1] 25.5 48.0 

But, quartile and summary do things differently:

 summary(X) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 18.00 26.25 32.00 35.75 46.50 63.00 quantile(X, c(0.25,0.5,0.75)) # 25% 50% 75% # 26.25 32.00 46.50 

The difference between this and the results from boxplot and fivenum depends on how the functions are interpolated between the data. quartile attempts to interpolate by evaluating the shape of the cumulative distribution function. According to ?quantile :

quantitative estimates of returns based on the distribution of quanta according to one or two ordinal statistics from the set elements in x in probability in problems. One of nine quantile algorithms in Hyndman and Fan (1996), selected by type, is discussed.

Details of nine different quantile methods quantile used to evaluate the data distribution function, which can be found in ?quantile , and are too long to fully reproduce here. It is important to note that 9 methods are taken from Hyndman and Fan (1996), who recommended type 8. The default method used by quantile refers to type 7 for historical reasons for compatibility with S. We can see estimates of quartiles provided by various methods in quantile, using:

 quantile_methods = data.frame(q25 = sapply(1:9, function(method) quantile(X, 0.25, type = method)), q50 = sapply(1:9, function(method) quantile(X, 0.50, type = method)), q75 = sapply(1:9, function(method) quantile(X, 0.75, type = method))) # q25 q50 q75 # 1 24.0000 30 45.000 # 2 25.5000 32 48.000 # 3 24.0000 30 45.000 # 4 24.0000 30 45.000 # 5 25.5000 32 48.000 # 6 24.7500 32 49.500 # 7 26.2500 32 46.500 # 8 25.2500 32 48.500 # 9 25.3125 32 48.375 

In which type = 5 gives the same quartile estimates as boxplot . However, when there is an odd amount of data, it type=7 will match the boxplot statistics.

We can show this by automatically choosing type 5 or 7 depending on whether there is an odd or even amount of data. Boxplot in the image below shows the quantile for data sets with 1 to 30 values, with boxplot and quantile giving the same values ​​for both odd and N:

 layout(matrix(1:30,5,6, byrow = T), respect = T) par(mar=c(0.2,0.2,0.2,0.2), bty="n", yaxt="n", xaxt="n") for (N in 1:30){ X = sample(100, N) boxplot(X) abline(h=quantile(X, c(0.25, 0.5, 0.75), type=c(5,7)[(N %% 2) + 1]), col="red", lty=2) } 

enter image description here


Hyndman, RJ and Fan, Y. (1996) Examples of quantiles in statistical packages, American statistics 50, 361-365

+4
source

Source: https://habr.com/ru/post/1259843/


All Articles