Outlier detection can be performed using the DBSCAN R-packet, a well-known algorithm used to identify the cluster (see WIKIPEDIA for more information).
This function has three important inputs:
- x: your data (numeric value only)
- eps: maximum maximum distance
- minPts: the number of minimum points to consider them as a cluster
Evaluation of eps can be done using the functions knndist (...) and knndistplot (...):
- knndistplot will display the eps values in your dataset for a given k (i.e. minPts) ==> you can visually select the effective eps value (usually in the part of the knee curve)
- knndist will evaluate eps values and return them in a matrix from. input k will generate 1: 1: k estimates, and you can use the result to programmatically determine the exact values of eps and k
Next, you should use dbscan (yourdata, eps, k) to get the dbscan object with the following components:
- eps: eps used for calculations
- minPts: minimum number of points for cluster identification
- cluster: an integer vector identifying points that belong to the cluster (= 1) or not (= 0). The latter correspond to the deviations that you want to eliminate.
Note the following restrictions on dbscan:
- dbscan uses the Euclidean distance and therefore goes to "Curse of Dimensions". This can be avoided with PCA.
- dbscan eliminates superimposed points that can generate unidentified points. This can be solved by combining the results with your data using the left outer join or using the jitter function (...), which adds noise to your data. According to the data you provided, I think this could be the case with your data.
Knowing these limitations, the dbscan package offers two alternative methods: LOF and OPTICS (DBSCAN extension)
Edit 25 Jabuary 2016
Following the @rawr answer, I give an example based on the mtcars to show how to use dbscan to determine outliers. Please note that my example will use a great data.table package instead of the classic data.frame .
First, I begin to replicate the rawr method to illustrate the use of data.table
require(data.table) require(ggplot2) require(dbscan) data(mtcars) dt_mtcars <- as.data.table(mtcars)

In this way, we can appreciate that we are getting the same results regardless of basic support.
Secondly, the following code illustrates the DBSCAN approach, which begins by determining eps and k , the required number of points to identify the cluster:
res_knn = kNNdist( dt_mtcars[, .(wt, mpg)] , k = 10) dim_knn = dim(res_knn) x_knn = seq(1, dim_knn[1]) ggplot() + geom_line( aes( x = x_knn , y = sort(res_knn[, 1]) , col = 1 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 2]) , col = 2 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 3]) , col = 3 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 4]) , col = 4 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 5]) , col = 5 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 6]) , col = 6 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 7]) , col = 7 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 8]) , col = 8 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 9]) , col = 9 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) ) + xlab('sorted results') + ylab('kNN distance')
The results are displayed in the following graph:

This shows that the estimated distance kNN is sensitive to the coefficient k , however, the exact eps value for separating the emissions is in the knee portion of the curves ==> the suitable eps is between 2 and 4.This is a visual assessment that can be automated using appropriate search algorithms (e.g. see this link ). As for k , it is necessary to determine a compromise, knowing that the lower k, less stringent results.
In the next part, we will parameterize dbscan with eps = 3 (based on visual assessment) and k = 4 for minor rigorous results. We will build these results using rawr code:
eps = 3 k = 4 res_dbscan = dbscan( dt_mtcars[, .(wt, mpg)] , eps , k ) plot(wt~mpg, data=dt_mtcars, col = res_dbscan$cluster) lo <- loess.smooth(dt_mtcars[res_dbscan$cluster>0,mpg], dt_mtcars[res_dbscan$cluster>0,wt]) lines(lo$x,lo$y, lwd=3) lines(lo$x,lo$y * 1.2, lwd=3 , col=2 ) lines(lo$x,lo$y / 1.2, lwd=3 , col=2 )

we got this figure where we can estimate that we got different results from the rawr approach, where the points located at mpg = [10,13] are considered outliers.
These results can be considered odd compared to the original solution, which works under the assumptions about two-dimensional data (Y ~ X). However, mtcars is a multidimensional data set in which the relationship between the variables can be (or not) linear ... To evaluate this moment, we could scatter this data set, filter by numerical values, for example
pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)])

If we focus only on the result wt ~ mpg , we might think that this is an antilinear connection at first glance. But with other related relationships, this may not be the case, and outlier detection in the N-Dim environment is a bit more complicated. In fact, one point can be considered as crowding out when designing, in particular, 2D comparisons ... but vice versa, if we add a new dimension of comparison. Indeed, we could have collinearity that could be identified and thus strengthen the cluster link or not.
My friends, I agree with this a lot of if , and in order to illustrate this situation, we will move on to the analysis of dbscan by the numerical values of mtcars .
So, I will reproduce the process presented earlier, and start by analyzing the distance kNN:
res_knn = kNNdist( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , k = 10) dim_knn = dim(res_knn) x_knn = seq(1, dim_knn[1]) ggplot() + geom_line( aes( x = x_knn , y = sort(res_knn[, 1]) , col = 1 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 2]) , col = 2 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 3]) , col = 3 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 4]) , col = 4 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 5]) , col = 5 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 6]) , col = 6 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 7]) , col = 7 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 8]) , col = 8 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 9]) , col = 9 ) ) + geom_line( aes( x = x_knn , y = sort(res_knn[, 10]) , col = 10 ) ) + xlab('sorted results') + ylab('kNN distance')

Compared with the analysis obtained at wt ~ mpg , we can see that kNNdist(...) produces a much more important distance kNN (for example, up to 200 with k = 10 ). However, we still have a part of the knee that helps us evaluate the appropriate eps value.
In the next part, we will use eps = 75 and k = 5 and
# optimal eps value is between 40 (k=1) and 130 (k=10) eps = 75 k = 5 res_dbscan = dbscan( dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , eps , k ) pairs(dt_mtcars[, .(mpg, disp, hp, drat, wt, qsec)] , col = res_dbscan$cluster+2L)

Thus, the scatter chart of this analysis emphasizes that outlier detection can be difficult in an N-Dim environment due to the complex relationships between variables. But keep in mind that in most cases the outliers are located in the corner of the 2D projection, which reinforces the results obtained with wt ~ mpg