The subset () of a vector in R

Question

The subset () of a vector in R

I wrote the following function based on subset() , which I find convenient:

 ss <- function (x, subset, ...) { r <- eval(substitute(subset), data.frame(.=x), parent.frame()) if (!is.logical(r)) stop("'subset' must be logical") x[r & !is.na(r)] }

So, I can write:

 ss(myDataFrame$MyVariableName, 500 < . & . < 1500)

instead

 myDataFrame$MyVariableName[ 500 < myDataFrame$MyVariableName & myDataFrame$MyVariableName < 1500]

It looks like other people might have developed solutions, although, including something in the R core, I might have missed. Is anything already there?

+6

r subset

Ken williams Jan 19 '12 at 21:19

source share

2 answers

42- · Answer 1 · 2012-01-19T22:29:23+0000

I understand that the solution Ken proposes is more general than just selecting elements within ranges (since it should work on any logical expression), but it reminded me that Greg Snow has comparison infix operators in his Teaching Demos package:

 library(TeachingDemos) x0 <- rnorm(100) x0[ 0 %<% x0 %<% 1.5 ]

Tyler rinker · Answer 2 · 2012-01-19T21:49:49+0000

Thanks for sharing Ken.

You can use:

 x <- myDataFrame$MyVariableName; x[x > 100 & x < 180]

You may need less text input, but the code is less generalized to others if you use common code. I have several time-saving features like myself, but I use them sparingly because they can slow down your code (additional steps) and require that you also include this code for this function when you share a file with someone yet.

Compare record length. Almost the same length:

 ss(mtcars$hp, 100 < . & . < 180) x <- mtcars$hp; x[x > 100 & x < 180]

Compare time for 1000 repetitions.

 library(rbenchmark) benchmark( tyler = x[x > 100 & x < 180], ken = ss(mtcars$hp, 100 <. & . < 180), replications=1000) test replications elapsed relative user.self sys.self user.child sys.child 2 ken 1000 0.56 18.66667 0.36 0.03 NA NA 1 tyler 1000 0.03 1.00000 0.03 0.00 NA NA

So, I think it depends on whether you need speed and / or availability compared to convenience. If this is only for you on a small dataset, I would say that it is valuable.

EDIT: NEW BENCHMARKING

 > benchmark( + tyler = {x <- mtcars$hp; x[x > 100 & x < 180]}, + ken = ss(mtcars$hp, 100 <. & . < 180), + ken2 = ss2(mtcars$hp, 100 <. & . < 180), + joran = with(mtcars,hp[hp>100 & hp< 180 ]), + replications=10000) test replications elapsed relative user.self sys.self user.child sys.child 4 joran 10000 0.83 2.677419 0.69 0.00 NA NA 2 ken 10000 3.79 12.225806 3.45 0.02 NA NA 3 ken2 10000 0.67 2.161290 0.35 0.00 NA NA 1 tyler 10000 0.31 1.000000 0.20 0.00 NA NA

The subset () of a vector in R

More articles: