The most efficient way to subset data

Can anyone suggest a more efficient way to subset data without using the SQL/indexing/data.table ?

I was looking for similar questions, and this one offers an indexing option.

Here are subset methods with timings.

 #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Subset and time system.time(x <- dat[dat$x > 500, ]) # user system elapsed # 0.092 0.000 0.090 system.time(x <- dat[which(dat$x > 500), ]) # user system elapsed # 0.040 0.032 0.070 system.time(x <- subset(dat, x > 500)) # user system elapsed # 0.108 0.004 0.109 

EDIT: Since Roland suggested that I used microbenchmark . which seems to perform the best.

 library("ggplot2") library("microbenchmark") #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Benchmark res <- microbenchmark( dat[dat$x > 500, ], dat[which(dat$x > 500), ], subset(dat, x > 500)) #plot autoplot.microbenchmark(res) 

enter image description here

+4
source share
1 answer

As Roland said, I used microbenchmark. which seems to perform the best.

 library("ggplot2") library("microbenchmark") #Dummy data dat <- data.frame(x = runif(1000000, 1, 1000), y=runif(1000000, 1, 1000)) #Benchmark res <- microbenchmark( dat[dat$x > 500, ], dat[which(dat$x > 500), ], subset(dat, x > 500)) #plot autoplot.microbenchmark(res) 

enter image description here

+1
source

Source: https://habr.com/ru/post/1488464/


All Articles