How to simulate geom_boxplot () with outliers using geom_boxplot (stat = "identity")

I would like to pre-compute brief summaries of the data (using plyr and passing the quantile function) and then build using geom_boxplot(stat = "identity") . This works great, except that (a) it does not display outliers as points and (b) expands the whiskers to the maximum and minimum number of data that is displayed.

Example:

 library(plyr) library(ggplot2) set.seed(4) df <- data.frame(fact = sample(letters[1:2], 12, replace = TRUE), val = c(1:10, 100, 101)) df # fact val # 1 b 1 # 2 a 2 # 3 a 3 # 4 a 4 # 5 b 5 # 6 a 6 # 7 b 7 # 8 b 8 # 9 b 9 # 10 a 10 # 11 b 100 # 12 a 101 by.fact.df <- ddply(df, c("fact"), function(x) quantile(x$val)) by.fact.df # fact 0% 25% 50% 75% 100% # 1 a 2 3.25 5.0 9.00 101 # 2 b 1 5.50 7.5 8.75 100 # What I can do...with faults (a) and (b) above ggplot(by.fact.df, aes(x = fact, ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`)) + geom_boxplot(stat = "identity") # What I want... ggplot(df, aes(x = fact, y = val)) + geom_boxplot() 

What can I do ... with the errors (a) and (b) mentioned above:

Plot 01

What I would like to receive, but still I use preliminary calculations through plyr (or another method):

Plot 02

Initial Thoughts: Perhaps there is some way to pre-calculate the true mustache endpoints without outliers? Then multiply the data for outliers and pass them as geom_point() ?

Motivation:. When working with larger datasets, I found it faster and more practical to use plyr , dplyr and / or data.table to pre-calculate statistics, and then build them, rather than calculate ggplot2 .

UPDATE

I can extract everything I need from the following combination of dplyr and plyr , but I'm not sure if this is the most efficient way:

 df %>% group_by(fact) %>% do(ldply(boxplot.stats(.$val), data.frame)) Source: local data frame [6 x 3] Groups: fact fact .id X..i.. 1 a stats 2 2 a stats 4 3 a stats 10 4 a stats 13 5 a stats 16 6 an 9 
+6
source share
2 answers

Here is my answer using the built-in functions quantile and boxplot.stats .

geom_boxplot calculations for boxplot are slightly different from boxplot.stats . Read ?geom_boxplot and ?boxplot.stats to understand my implementation below

 #Function to calculate boxplot stats to match ggplot implemention as in geom_boxplot. my_boxplot.stats <-function(x){ quantiles <-quantile(x, c(0, 0.25, 0.5, 0.75, 1)) labels <-names(quantile(x)) #replacing the upper whisker to geom_boxplot quantiles[5] <-boxplot.stats(x)$stats[5] res <-data.frame(rbind(quantiles)) names(res) <-labels res$out <-boxplot.stats(x)$out return(res) } 

Code for calculating statistics and plotting

 library(dplyr) df %>% group_by(fact) %>% do(my_boxplot.stats(.$val)) %>% ggplot(aes(x=fact, y=out, ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`)) + geom_boxplot(stat = "identity") + geom_point() 
+6
source

To get the right statistics, you need to do a few more calculations than just finding the quantiles. The geom_boxplot function with stat = "identity" does not output outliers. Therefore, you need to calculate statistics without outliers, and then use geom_point to separate outliers separately. The following function (basically a simplified version of stat_boxplot ) is probably not the most efficient, but gives the desired result:

 box.df <- df %>% group_by(fact) %>% do({ stats <- as.numeric(quantile(.$val, c(0, 0.25, 0.5, 0.75, 1))) iqr <- diff(stats[c(2, 4)]) coef <- 1.5 outliers <- .$val < (stats[2] - coef * iqr) | .$val > (stats[4] + coef * iqr) if (any(outliers)) { stats[c(1, 5)] <- range(c(stats[2:4], .$val[!outliers]), na.rm=TRUE) } outlier_values = .$val[outliers] if (length(outlier_values) == 0) outlier_values <- NA_real_ res <- as.list(t(stats)) names(res) <- c("lower.whisker", "lower.hinge", "median", "upper.hinge", "upper.whisker") res$out <- outlier_values as.data.frame(res) }) box.df ## Source: local data frame [2 x 7] ## Groups: fact ## ## fact lower.whisker lower.hinge median upper.hinge upper.whisker out ## 1 a 2 3.25 5.0 9.00 10 101 ## 2 b 1 5.50 7.5 8.75 9 100 ggplot(box.df, aes(x = fact, y = out, middle = median, ymin = lower.whisker, ymax = upper.whisker, lower = lower.hinge, upper = upper.hinge)) + geom_boxplot(stat = "identity") + geom_point() 
+2
source

Source: https://habr.com/ru/post/986665/


All Articles