I need to drop one column from the data.frame file containing several hundred columns.
With data.frame I would use subset for this:
> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' ) > subset(dat,select=c(-z)) xy 1: 0.1969049 a 2: 0.7916696 a 3: 0.9095970 b 4: 0.3529506 b 5: 0.4923602 c 6: 0.5993034 c 7: 0.1559861 d 8: 0.9929333 d 9: 0.3980169 e 10: 0.1921226 e
Obviously this still works, but it doesn't seem like a lot like the t23-like idiom. I could manually create a list of column names that I wanted to keep, which seems a bit more data.table -like:
> dat[,list(x,y)] xy 1: 0.1969049 a 2: 0.7916696 a 3: 0.9095970 b 4: 0.3529506 b 5: 0.4923602 c 6: 0.5993034 c 7: 0.1559861 d 8: 0.9929333 d 9: 0.3980169 e 10: 0.1921226 e
But then I need to build a list that is awkward.
Is subset right way to conveniently remove a column or two, or does it lead to performance degradation? If not, which is better?
Edit
Landmarks:
> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' ) > microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] ) Unit: milliseconds expr min lq median uq max 1 dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207 2 subset(dat, select = c(-z)) 33.26356 52.55311 53.53934 55.00347 180.8740
But in fact, where it may matter, it is memory if the subset copies all data.table .