Delete values ​​outside the Loess curve

I want to remove outliers before applying the model. I use the Loess curve to demarcate the trend line and set excess limits. I would like to delete lines that are outside of certain restrictions. In addition, with the help of a custom function that takes each point one at a time and checks the local angle of Loss, etc ... is there an easier way?

Loess trend line with limits (1.2)

# Code generating image above scatter.smooth( idam$T_d, idam$T_x10d) loessline <- loess.smooth( idam$T_d, idam$T_x10d) lines(loessline$x, loessline$y, lwd=3) lines(loessline$x, loessline$y*1.2, lwd=3, col='red') lines(loessline$x, loessline$y/1.2, lwd=3, col='red') 
+5
source share
3 answers

Outlier detection can be performed using the DBSCAN R-packet, a well-known algorithm used to identify the cluster (see WIKIPEDIA for more information).

This function has three important inputs:

  • x: your data (numeric value only)
  • eps: maximum maximum distance
  • minPts: the number of minimum points to consider them as a cluster

Evaluation of eps can be done using the functions knndist (...) and knndistplot (...):

  • knndistplot will display the eps values ​​in your dataset for a given k (i.e. minPts) ==> you can visually select the effective eps value (usually in the part of the knee curve)
  • knndist will evaluate eps values ​​and return them in a matrix from. input k will generate 1: 1: k estimates, and you can use the result to programmatically determine the exact values ​​of eps and k

Next, you should use dbscan (yourdata, eps, k) to get the dbscan object with the following components:

  • eps: eps used for calculations
  • minPts: minimum number of points for cluster identification
  • cluster: an integer vector identifying points that belong to the cluster (= 1) or not (= 0). The latter correspond to the deviations that you want to eliminate.

Note the following restrictions on dbscan:

  • dbscan uses the Euclidean distance and therefore goes to "Curse of Dimensions". This can be avoided with PCA.
  • dbscan eliminates superimposed points that can generate unidentified points. This can be solved by combining the results with your data using the left outer join or using the jitter function (...), which adds noise to your data. According to the data you provided, I think this could be the case with your data.

Knowing these limitations, the dbscan package offers two alternative methods: LOF and OPTICS (DBSCAN extension)

Edit 25 Jabuary 2016

Following the @rawr answer, I give an example based on the mtcars to show how to use dbscan to determine outliers. Please note that my example will use a great data.table package instead of the classic data.frame .

First, I begin to replicate the rawr method to illustrate the use of data.table

 require(data.table) require(ggplot2) require(dbscan) data(mtcars) dt_mtcars <- as.data.table(mtcars) # based on rawr approach plot(wt~mpg, data=dt_mtcars) lo <- loess.smooth(dt_mtcars[,mpg], dt_mtcars[,wt]) lines(lo$x,lo$y, lwd=3) lines(lo$x,lo$y * 1.2, lwd=3 , col=2 ) lines(lo$x,lo$y / 1.2, lwd=3 , col=2 ) 

enter image description here

In this way, we can appreciate that we are getting the same results regardless of basic support.

Secondly, the following code illustrates the DBSCAN approach, which begins by determining eps and k , the required number of points to identify the cluster:

 res_knn = kNNdist( dt_mtcars[, .(wt, mpg)] , k = 10) dim_knn = dim(res_knn) x_knn = seq(1, dim_knn[1]) ggplot() + geom_line( aes( x = x_knn , y = sort(res_knn[, 1]) , col = 1 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 2]) , col = 2 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 3]) , col = 3 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 4]) , col = 4 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 5]) , col = 5 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 6]) , col = 6 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 7]) , col = 7 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 8]) , col = 8 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 9]) , col = 9 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) ) + xlab('sorted results') + ylab('kNN distance') 

The results are displayed in the following graph:

enter image description here

This shows that the estimated distance kNN is sensitive to the coefficient k , however, the exact eps value for separating the emissions is in the knee portion of the curves ==> the suitable eps is between 2 and 4.This is a visual assessment that can be automated using appropriate search algorithms (e.g. see this link ). As for k , it is necessary to determine a compromise, knowing that the lower k, less stringent results.

In the next part, we will parameterize dbscan with eps = 3 (based on visual assessment) and k = 4 for minor rigorous results. We will build these results using rawr code:

 eps = 3 k = 4 res_dbscan = dbscan( dt_mtcars[, .(wt, mpg)] , eps , k ) plot(wt~mpg, data=dt_mtcars, col = res_dbscan$cluster) lo <- loess.smooth(dt_mtcars[res_dbscan$cluster>0,mpg], dt_mtcars[res_dbscan$cluster>0,wt]) lines(lo$x,lo$y, lwd=3) lines(lo$x,lo$y * 1.2, lwd=3 , col=2 ) lines(lo$x,lo$y / 1.2, lwd=3 , col=2 ) 

enter image description here

we got this figure where we can estimate that we got different results from the rawr approach, where the points located at mpg = [10,13] are considered outliers.

These results can be considered odd compared to the original solution, which works under the assumptions about two-dimensional data (Y ~ X). However, mtcars is a multidimensional data set in which the relationship between the variables can be (or not) linear ... To evaluate this moment, we could scatter this data set, filter by numerical values, for example

 pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)]) 

enter image description here

If we focus only on the result wt ~ mpg , we might think that this is an antilinear connection at first glance. But with other related relationships, this may not be the case, and outlier detection in the N-Dim environment is a bit more complicated. In fact, one point can be considered as crowding out when designing, in particular, 2D comparisons ... but vice versa, if we add a new dimension of comparison. Indeed, we could have collinearity that could be identified and thus strengthen the cluster link or not.

My friends, I agree with this a lot of if , and in order to illustrate this situation, we will move on to the analysis of dbscan by the numerical values ​​of mtcars .

So, I will reproduce the process presented earlier, and start by analyzing the distance kNN:

 res_knn = kNNdist( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , k = 10) dim_knn = dim(res_knn) x_knn = seq(1, dim_knn[1]) ggplot() + geom_line( aes( x = x_knn , y = sort(res_knn[, 1]) , col = 1 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 2]) , col = 2 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 3]) , col = 3 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 4]) , col = 4 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 5]) , col = 5 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 6]) , col = 6 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 7]) , col = 7 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 8]) , col = 8 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 9]) , col = 9 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) ) + xlab('sorted results') + ylab('kNN distance') 

sorted kNN distances

Compared with the analysis obtained at wt ~ mpg , we can see that kNNdist(...) produces a much more important distance kNN (for example, up to 200 with k = 10 ). However, we still have a part of the knee that helps us evaluate the appropriate eps value.

In the next part, we will use eps = 75 and k = 5 and

 # optimal eps value is between 40 (k=1) and 130 (k=10) eps = 75 k = 5 res_dbscan = dbscan( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , eps , k ) pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , col = res_dbscan$cluster+2L) 

enter image description here

Thus, the scatter chart of this analysis emphasizes that outlier detection can be difficult in an N-Dim environment due to the complex relationships between variables. But keep in mind that in most cases the outliers are located in the corner of the 2D projection, which reinforces the results obtained with wt ~ mpg

+7
source

You can use approxfun

Here is an example with "outliers"

 plot(wt ~ mpg, data = mtcars) lo <- loess.smooth(mtcars$mpg, mtcars$wt) lines(lo$x, lo$y, lwd = 3) lines(lo$x, lo$y * 1.2, lwd = 3, col = 2) lines(lo$x, lo$y / 1.2, lwd = 3, col = 2) 

enter image description here

approxfun returns a function using the observed x values, which we can use to interpolate the set of new y values.

Then you can set the threshold for calling the outlier point; here I use 1.2 * y , as in the original question, to identify extreme observations.

 f1 <- approxfun(lo$x, lo$y * 1.2) (wh1 <- which(mtcars$wt > f1(mtcars$mpg))) # [1] 8 17 18 f2 <- approxfun(lo$x, lo$y / 1.2) (wh2 <- which(mtcars$wt < f2(mtcars$mpg))) # [1] 28 ## identify points to exclude mt <- mtcars[c(wh1, wh2), ] points(mt$mpg, mt$wt, pch = 4, col = 2, cex = 2) 

enter image description here

 ## plot without points plot(wt ~ mpg, data = mt2 <- mtcars[-c(wh1, wh2), ]) lo <- loess.smooth(mt2$mpg, mt2$wt) lines(lo$x, lo$y, lwd = 3) lines(lo$x, lo$y * 1.2, lwd = 3, col = 2) lines(lo$x, lo$y / 1.2, lwd = 3, col = 2) 

enter image description here

Since there are a few steps here, you can compose this into a function to make things a little easier:

 par(mfrow = c(2,2)) with(mtcars, { plot_lo(mpg, wt) plot_lo(mpg, wt, limits = c(1 / 1.5, 1.5)) dd <<- plot_lo(mpg, wt, limits = c(1 / 1.2, 1.2)) plot_lo(mpg, wt, pch = 16, las = 1, tcl = .5, bty = 'l') }) str(dd) # List of 2 # $ x: num [1:28] 21 21 22.8 21.4 18.7 18.1 14.3 22.8 19.2 17.8 ... # $ y: num [1:28] 2.62 2.88 2.32 3.21 3.44 ... 

enter image description here

 plot_lo <- function(x, y, limits = c(-Inf, Inf), ...) { lo <- loess.smooth(x, y) fx <- approxfun(lo$x, lo$y * limits[1L]) fy <- approxfun(lo$x, lo$y * limits[2L]) idx <- which(y < fx(x) | y > fy(x)) if (length(idx)) { x <- x[-idx] y <- y[-idx] lo <- loess.smooth(x, y) } op <- par(..., no.readonly = TRUE) on.exit(par(op)) plot(x, y) lines(lo$x, lo$y, lwd = 3) lines(lo$x, lo$y * limits[1L], lwd = 3, col = 2L) lines(lo$x, lo$y * limits[2L], lwd = 3, col = 2L) invisible(list(x = x, y = y)) } 
+6
source

My suggestion is to take a look at the outliers package . This package allows you to identify yourself before analysis. This is a very simple example:

 library(outliers) series<-c(runif(100,1,2),1000) round(scores(series,prob=1,type="chisq"),3) 

With this function, you can perform many tests, and you can set the level of probability that you are an outlier that suits you.

 series<-series[which(series<0.95),] 
0
source

Source: https://habr.com/ru/post/1206417/


All Articles