How to remove records from dataframe that go beyond variables? [R]

I have a data frame and a predictive model that I want to apply to data. However, I want to filter out entries for which the model may not be very well applied. To do this, I have another data frame that contains for each variable the minimum and maximum observed in the training data. I want to remove these records from my new data for which one or more values โ€‹โ€‹are outside the specified range.

To make my question clear, here is how my data might look:

id xy ---- ---- --------- 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000 

This is what my second table with minimum and maximum values โ€‹โ€‹might look like this:

  var min max ----- ----- ------- x 1 5 y 0 99999 

In this example, I would like to note the following entries in my data: 2 (below the minimum for x) and 4 (more than max for y).

How could I easily do this in R? I have a hunch, there is some smart dplyr code that will perform this task, but I donโ€™t know how it will look.

+5
source share
5 answers

You have data:

 df = data.frame(x=c(2,-1,5,4,7,8), y=c(30521, 1800, 25000,1000000, -5, 10)) limits = data.frame("var"=c("x", "y"), min=c(1,0), max=c(5,99999)) 

You can use the sweep function with the '>' and '<' operator quite simply!

 sweep(df, 2, limits[, 2], FUN='>') & sweep(df, 2, limits[, 3], FUN='<') #### xy #### [1,] TRUE TRUE #### [2,] FALSE TRUE #### [3,] FALSE FALSE #### [4,] TRUE FALSE #### [5,] FALSE FALSE #### [6,] FALSE TRUE 

TRUE locations tell you which observations should be stored for each variable. It should work for any number of variables

After that, if you need a global flag (at least a flag in one column), you can run this simple line (res is the previous output)

 apply(res, 1, all) #### [1] TRUE FALSE FALSE FALSE FALSE FALSE 
+1
source

Not very elegant, but anyway:

 df <- read.table(header=T, text=" id xy 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000 ") df ranges <- read.table(header=T, text=" var min max x 1 5 y 0 99999") ranges <- ranges[match(ranges[,1], names(df)[-1]), ] # sort ranges, if necessary matrixStats::rowAnys( !sapply(seq_along(df)[-1], function(x) { df[,x]>=ranges[x-1,2] & df[,x]<=ranges[x-1,3] }) ) -> df$flag df$flag # [1] FALSE TRUE FALSE TRUE 
0
source

Something similar with dplyr:

 library(dplyr) df <- read.table(text = " id xy 1 2 30521 2 -1 1835 3 5 25939 4 4 1000000 ", header = TRUE) dfilte <- read.table(text = " var min max x 1 5 y 0 99999 ", header = TRUE) df %>% mutate(flag_x = x %in% dfilte[1, -1], flax_y = y %in% dfilte[2, -1]) 

which produces this conclusion:

  id xy flag_x flax_y 1 1 2 30521 FALSE FALSE 2 2 -1 1835 FALSE FALSE 3 3 5 25939 TRUE FALSE 4 4 4 1000000 FALSE FALSE 
0
source

I think your problem is well suited for using the cut function in the R base:

 df$to.remove <- is.na(cut(df$x, breaks = ranges[1,][,-1])) | is.na(cut(df$y, breaks = ranges[2,][,-1])) # id xy to.remove #1 1 2 30521 FALSE #2 2 -1 1835 TRUE #3 3 5 25939 FALSE #4 4 4 1000000 TRUE 

is.na(...) will give you a logical vector in which values โ€‹โ€‹from the specified range are TRUE . Finally, you use the | , namely or , to decide which ones to remove.

To clear the data, you just need to do this:

 df <- df[!df$to.remove,] 

EDIT

I just noticed (from your comment) that your data frame contains more variables than just x and y . In this case, you can define a function named f and do the following for the number of variables that you have in your data frame.

 f <- function(x, xrange, y, yrange) { (is.na(cut(x, breaks = xrange)) | is.na(cut(y, breaks = yrange)))} res <- f(df$x, ranges[1,][-1], df$y, ranges[2,][-1]) 

<strong> data

 df <- structure(list(id = 1:4, x = c(2L, -1L, 5L, 4L), y = c(30521L, 1835L, 25939L, 1000000L)), .Names = c("id", "x", "y"), class = "data.frame", row.names = c(NA, -4L)) ranges <- structure(list(var = structure(1:2, .Label = c("x", "y"), class = "factor"), min = c(1L, 0L), max = c(5L, 99999L)), .Names = c("var", "min", "max"), class = "data.frame", row.names = c(NA, -2L)) 
0
source

Does not understand your desired result, but it will work with any range and any amount of data:

 > df id xy 1 1 2 30521 2 2 -1 1835 3 3 5 25939 4 4 4 1000000 #I transpose your filter data frame so its easier to work with. > dfFilter xy min 1 0 max 5 99999 

And then you can apply your range-based filter in dfFilter :

 #Flag original dataframe with values between the minimum x and maximum x df$flag_x=ifelse(df$x > min(dfFilter$x) & df$x < max(dfFilter$x), "yes","no") #Flag original dataframe with values between the minimum y and maximum y df$flag_y=ifelse(df$y > min(dfFilter$y) & df$y < max(dfFilter$y), "yes","no") 

So, the output is as follows:

  id xy flag_x flag_y 1 1 2 30521 yes yes 2 2 -1 1835 no yes 3 3 5 25939 no yes 4 4 4 1000000 yes yes 

Of course, you can change these filters or perform any mathematical operations so that you have the desired result (for example, at least x-2: min(dfFilter$x)-2 ).

Hope this works.

0
source

Source: https://habr.com/ru/post/1258021/


All Articles