The most efficient way to subset data

Question

The most efficient way to subset data

Can anyone suggest a more efficient way to subset data without using the SQL/indexing/data.table ?

I was looking for similar questions, and this one offers an indexing option.

Here are subset methods with timings.

 #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Subset and time system.time(x <- dat[dat$x > 500, ]) # user system elapsed # 0.092 0.000 0.090 system.time(x <- dat[which(dat$x > 500), ]) # user system elapsed # 0.040 0.032 0.070 system.time(x <- subset(dat, x > 500)) # user system elapsed # 0.108 0.004 0.109

EDIT: Since Roland suggested that I used microbenchmark . which seems to perform the best.

 library("ggplot2") library("microbenchmark") #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Benchmark res <- microbenchmark( dat[dat$x > 500, ], dat[which(dat$x > 500), ], subset(dat, x > 500)) #plot autoplot.microbenchmark(res)

enter image description here

+4

performance r dataframe subset

zx8754 Jun 27 '13 at 10:19

source share

1 answer

zx8754 · Accepted Answer · 2013-09-12T09:25:09+0000

As Roland said, I used microbenchmark. which seems to perform the best.

 library("ggplot2") library("microbenchmark") #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Benchmark res <- microbenchmark( dat[dat$x > 500, ], dat[which(dat$x > 500), ], subset(dat, x > 500)) #plot autoplot.microbenchmark(res)

The most efficient way to subset data

More articles: