How to remove records from dataframe that go beyond variables? [R]

Question

How to remove records from dataframe that go beyond variables? [R]

I have a data frame and a predictive model that I want to apply to data. However, I want to filter out entries for which the model may not be very well applied. To do this, I have another data frame that contains for each variable the minimum and maximum observed in the training data. I want to remove these records from my new data for which one or more values are outside the specified range.

To make my question clear, here is how my data might look:

id xy ---- ---- --------- 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000

This is what my second table with minimum and maximum values might look like this:

  var min max ----- ----- ------- x 1 5 y 0 99999

In this example, I would like to note the following entries in my data: 2 (below the minimum for x) and 4 (more than max for y).

How could I easily do this in R? I have a hunch, there is some smart dplyr code that will perform this task, but I don’t know how it will look.

+5

r outliers

A. Stam Oct 11 '16 at 12:00

source share

5 answers

Not very elegant, but anyway:

 df <- read.table(header=T, text=" id xy 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000 ") df ranges <- read.table(header=T, text=" var min max x 1 5 y 0 99999") ranges <- ranges[match(ranges[,1], names(df)[-1]), ] # sort ranges, if necessary matrixStats::rowAnys( !sapply(seq_along(df)[-1], function(x) { df[,x]>=ranges[x-1,2] & df[,x]<=ranges[x-1,3] }) ) -> df$flag df$flag # [1] FALSE TRUE FALSE TRUE

0

lukeA Oct 11 '16 at 12:12

source share

Something similar with dplyr:

 library(dplyr) df <- read.table(text = " id xy 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000 ", header = TRUE) dfilte <- read.table(text = " var min max x 1 5 y 0 99999 ", header = TRUE) df %>% mutate(flag_x = x %in% dfilte[1, -1], flax_y = y %in% dfilte[2, -1])

which produces this conclusion:

  id xy flag_x flax_y 1 1 2 30521 FALSE FALSE 2 2 -1 1835 FALSE FALSE 3 3 5 25939 TRUE FALSE 4 4 4 1000000 FALSE FALSE

0

Sabdem Oct 11 '16 at 12:14

source share

I think your problem is well suited for using the cut function in the R base:

 df$to.remove <- is.na(cut(df$x, breaks = ranges[1,][,-1])) | is.na(cut(df$y, breaks = ranges[2,][,-1])) # id xy to.remove #1 1 2 30521 FALSE #2 2 -1 1835 TRUE #3 3 5 25939 FALSE #4 4 4 1000000 TRUE

is.na(...) will give you a logical vector in which values from the specified range are TRUE . Finally, you use the | , namely or , to decide which ones to remove.

To clear the data, you just need to do this:

 df <- df[!df$to.remove,]

EDIT

I just noticed (from your comment) that your data frame contains more variables than just x and y . In this case, you can define a function named f and do the following for the number of variables that you have in your data frame.

 f <- function(x, xrange, y, yrange) { (is.na(cut(x, breaks = xrange)) | is.na(cut(y, breaks = yrange)))} res <- f(df$x, ranges[1,][-1], df$y, ranges[2,][-1])

<strong> data

 df <- structure(list(id = 1:4, x = c(2L, -1L, 5L, 4L), y = c(30521L, 1835L, 25939L, 1000000L)), .Names = c("id", "x", "y"), class = "data.frame", row.names = c(NA, -4L)) ranges <- structure(list(var = structure(1:2, .Label = c("x", "y"), class = "factor"), min = c(1L, 0L), max = c(5L, 99999L)), .Names = c("var", "min", "max"), class = "data.frame", row.names = c(NA, -2L))

0

989 Oct 11 '16 at 12:40

source share

Does not understand your desired result, but it will work with any range and any amount of data:

 > df id xy 1 1 2 30521 2 2 -1 1835 3 3 5 25939 4 4 4 1000000 #I transpose your filter data frame so its easier to work with. > dfFilter xy min 1 0 max 5 99999

And then you can apply your range-based filter in dfFilter :

 #Flag original dataframe with values between the minimum x and maximum x df$flag_x=ifelse(df$x > min(dfFilter$x) & df$x < max(dfFilter$x), "yes","no") #Flag original dataframe with values between the minimum y and maximum y df$flag_y=ifelse(df$y > min(dfFilter$y) & df$y < max(dfFilter$y), "yes","no")

So, the output is as follows:

  id xy flag_x flag_y 1 1 2 30521 yes yes 2 2 -1 1835 no yes 3 3 5 25939 no yes 4 4 4 1000000 yes yes

Of course, you can change these filters or perform any mathematical operations so that you have the desired result (for example, at least x-2: min(dfFilter$x)-2 ).

Hope this works.

0

Cris Oct 11 '16 at 12:48

source share

agenis · Accepted Answer · 2016-10-11T12:22:54+0000

You have data:

 df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10)) limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999))

You can use the sweep function with the '>' and '<' operator quite simply!

 sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<') #### xy #### [1,] TRUE TRUE #### [2,] FALSE TRUE #### [3,] FALSE FALSE #### [4,] TRUE FALSE #### [5,] FALSE FALSE #### [6,] FALSE TRUE

TRUE locations tell you which observations should be stored for each variable. It should work for any number of variables

After that, if you need a global flag (at least a flag in one column), you can run this simple line (res is the previous output)

 apply(res, 1, all) #### [1] TRUE FALSE FALSE FALSE FALSE FALSE

How to remove records from dataframe that go beyond variables? [R]

More articles: