Find the dots above and below the confidence interval when using geom_stat / geom_smooth in ggplot2

I have a scatter plot, I want to know how I can find the genes above and below the confidence interval lines?

enter image description here


EDIT: Playable Example:

library(ggplot2) #dummy data df <- mtcars[,c("mpg","cyl")] #plot ggplot(df,aes(mpg,cyl)) + geom_point() + geom_smooth() 

enter image description here

+5
source share
3 answers

I had to dive deep into the github repo, but I finally got it. To do this, you need to know how stat_smooth works. In this particular case, the loess function is called to perform smoothing (various smoothing functions can be built using the same process as below):

So, using loess in this case, we will do:

 #data df <- mtcars[,c("mpg","cyl"), with=FALSE] #run loess model cars.lo <- loess(cyl ~ mpg, df) 

Then I had to read this to see how forecasts were made inside stat_smooth . Hasley apparently uses the predictdf function (which is not exported to the namespace), as described below for our case:

 predictdf.loess <- function(model, xseq, se, level) { pred <- stats::predict(model, newdata = data.frame(x = xseq), se = se) if (se) { y = pred$fit ci <- pred$se.fit * stats::qt(level / 2 + .5, pred$df) ymin = y - ci ymax = y + ci data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit) } else { data.frame(x = xseq, y = as.vector(pred)) } } 

After reading the above, I was able to create my own data.frame of forecasts using:

 #get the predictions ie the fit and se.fit vectors pred <- predict(cars.lo, se=TRUE) #create a data.frame from those df2 <- data.frame(mpg=df$mpg, fit=pred$fit, se.fit=pred$se.fit * qt(0.95 / 2 + .5, pred$df)) 

Looking at predictdf.loess , we see that the upper bound of the confidence interval is created as pred$fit + pred$se.fit * qt(0.95 / 2 + .5, pred$df) , and the lower bound is created as pred$fit - pred$se.fit * qt(0.95 / 2 + .5, pred$df) .

Using them, we can create a flag for points above or below these borders:

 #make the flag outerpoints <- +(df$cyl > df2$fit + df2$se.fit | df$cyl < df2$fit - df2$se.fit) #add flag to original data frame df$outer <- outerpoints 

The df$outer column is probably looking for OP (it takes the value 1 if it is outside the bounds or 0 otherwise), but just for the sake of it I draw it below.

Note that the + function above is used here only to convert a boolean flag to a numeric value.

Now, if we build it like:

 ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() 

We can see the points inside and outside the confidence interval.

Output:

enter image description here

PS For those interested in the upper and lower bounds, they are created in this way (assumption: although the shaded areas are probably created using geom_ribbon - or something like that - which makes them more round and beautiful):

 #upper boundary ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() + geom_line(data=df2, aes(mpg , fit + se.fit , group=1), colour='red') #lower boundary ggplot(df,aes(mpg,cyl)) + geom_point(aes(colour=factor(outer))) + geom_smooth() + geom_line(data=df2, aes(mpg , fit - se.fit , group=1), colour='red') 
+7
source

This solution uses the hard work ggplot2 does for you:

 library(sp) # we have to build the plot first so ggplot can do the calculations ggplot(df,aes(mpg,cyl)) + geom_point() + geom_smooth() -> gg # do the calculations gb <- ggplot_build(gg) # get the CI data p <- gb$data[[2]] # make a polygon out of it poly <- data.frame( x=c(p$x[1], p$x, p$x[length(p$x)], rev(p$x)), y=c(p$ymax[1], p$ymin, p$ymax[length(p$x)], rev(p$ymax)) ) # test for original values in said polygon and add that to orig data # so we can color by it df$in_ci <- point.in.polygon(df$mpg, df$cyl, poly$x, poly$y) # re-do the plot with the new data ggplot(df,aes(mpg,cyl)) + geom_point(aes(color=factor(in_ci))) + geom_smooth() 

enter image description here

This requires some tweaking (i.e. the last point gets the value 2 ), but I'm limited in time. Note that the return values โ€‹โ€‹of point.in.polygon :

  • 0 : point strictly external to pol
  • 1 : point strictly internal to pol
  • 2 : the point lies on the relative inner surface of the edge pol
  • 3 : point is the top of pol

so you just need to change the code to TRUE / FALSE , the value is 0 or not.

+8
source

Using ggplot_build as a nice @hrbrmstr solution, you can actually do this by simply passing a sequence of x values โ€‹โ€‹to geom_smooth , indicating where the error boundaries should be calculated, and make it equal to the x-values โ€‹โ€‹of your point. Then you just see if the y values โ€‹โ€‹are within the range.

 library(ggplot2) ## dummy data df <- mtcars[,c("mpg","cyl")] ggplot(df, aes(mpg, cyl)) + geom_smooth(params=list(xseq=df$mpg)) -> gg ## Find the points within bounds bounds <- ggplot_build(gg)[[1]][[1]] df$inside <- with(df, bounds$ymin < cyl & bounds$ymax > cyl) ## Add the points gg + geom_point(data=df, aes(color=inside)) + theme_bw() 

enter image description here

+6
source

Source: https://habr.com/ru/post/1233528/


All Articles